SlideShare a Scribd company logo
1 of 40
Download to read offline
STOCK PRICE PREDICTION USING SENTIMENT
ANALYSIS
A capstone project report submitted in partial fulfillment of the requirement for the award of the
degree of
Bachelor of Engineering
Electronics and Communication Department
Submitted By
● Team Member 1 Nitish Garg 101915020 ngarg_be19@thapar.edu
● Team Member 2 Mayukh Sharma 101915076 msharma_be19@thapar.edu
● Team Member 3 Shruti Singh 101956003 ssingh11_be19@thapar.edu
● Team Member 4 Riya Bajaj 101915092 rbajaj_be19@thapar.edu
● Team Member 5 Himank Jindal 101915023 hjindal_be19@thapar.edu
Under the supervision of
Dr. Rajesh Khanna (Professor, ECED)
Dr. Surbhi Sharma (Associate Professor, ECED)
Department of Electronics and Communication Engineering
THAPAR INSTITUTE OF ENGINEERING & TECHNOLOGY, PATIALA, PUNJAB
December 2022
DECLARATION
We hereby declare that the capstone project group report titled “Stock price Prediction using
Sentiment Analysis” is an authentic record of our own work carried out at “Thapar Institute of
Engineering and Technology, Patiala” as a capstone project in the seventh semester of B.E.
(Electronics & Communication Engineering), under the guidance of “Dr. Rajesh Khanna” and
“Dr. Surbhi Sharma” from January to December 2022.
Date: Aug 15, 2022
Name Registration Number Signature
Himank Jindal 101915023
Nitish Garg 101915020
Riya Bajaj 101915092
Mayukh Sharma 101915076
Shruti Singh 101956003
2
ACKNOWLEDGEMENT
We would like to express our thanks to our mentors Dr. Rajesh Khanna and Dr. Surbhi Sharma.
They have been of great help in our project and are an indispensable resource of technical
knowledge. They truly are amazing mentors to have.
We are also thankful to Dr. Alpana Agarwal, Head, Electronics and Communication Engineering
Department, the entire faculty and staff of the Electronics and Communication Engineering
Department, and also our friends, who devoted their valuable time and helped us in all possible
ways towards the successful completion of this project. We thank all those who have contributed
either directly or indirectly towards this project.
Lastly, we would also like to thank our families for their unyielding love and encouragement. They
always wanted the best for us, and we admire their determination and sacrifice.
Date: Aug 15, 2022
Name Registration Number
Himank Jindal 101915023
Nitish Garg 101915020
Mayukh Sharma 101915076
Riya Bajaj 101915092
Shruti Singh 101956003
3
ABSTRACT
Prediction and analysis of the stock market are some of the most difficult tasks to execute. There
are numerous reasons for this, including market volatility and a variety of other dependent and
independent variables that influence the market value of a particular stock. Because of these
factors, it is extremely difficult for any stock market expert to predict the market's rise and fall
with great accuracy, and thus it becomes challenging for the investor to invest the money to make
profits.
We will develop a machine learning model to forecast stock prices using the Long Short-Term
Memory (LSTM) approach. By multiplying and adding, they are utilized to make small
modifications to the data. Long-term memory (LSTM) is a deep learning artificial recurrent neural
network (RNN) architecture. LSTM has feedback connections, unlike standard feed-forward
neural networks. It can handle both single data points (like photos) and entire data sequences (such
as speech or video).
Text mining and natural language processing (NLP) are used in sentiment analysis, also known as
opinion mining, to identify and extract subjective material from users' opinions, assessments,
feelings, attitudes, and emotions. The stock price is more likely to rise if the news sentiment is
positive; the stock price is more likely to decrease if the news sentiment is negative. The goal of
this research is to create a model that can predict news polarity, which can affect stock market
trends.
4
TABLE OF CONTENTS
Declaration 2
Acknowledgment 3
Abstract 4
List of figures 6
Chapter 1: Introduction 7
1.1 Project overview 7
1.2 Motivation 7
1.3 Assumptions and constraints 7
1.4 Novelty of work 7
Chapter 2: Literature survey 8
2.1 Literature survey 8
2.2 Project timeline 9
2.3 Problem definition and scope 12
2.4 Risk analysis 12
2.5 Approved objective 13
2.6 Project outcomes and Deliverables 15
2.7 Risk Analysis 15
Chapter 3: Flowchart 15
3.1 Workflow architecture 16
3.2 Activity diagram 16
3.3 Tools and technologies used 17
Chapter 4: Project description 17
4.1 Libraries and languages used 18
4.2 Tools 18
4.3 Procedure 20
Chapter 5: Implementation and Experimental Results 21
5.1 Sample Code 21
5.2 Output and Accuracy 37
Chapter 6: Outcomes and Prospective learning 38
6.1 Outcomes 38
5
6.2 Future Scopes 38
6.3 Prospective learning 38
6.4 Conclusion 38
Chapter 7: Project timeline 39
7.1 Gantt chart 39
7.2 Project timeline 39
References 40
LIST OF FIGURES
Figure No. Figure Content Page no.
Figure 1 Architectural Growth
Sequence
16
Figure 2 Activity diagram 17
Figure 3 LSTM diagram 18
Figure 4 BERT diagram 20
Figure 5 Gantt Chart 39
6
CHAPTER 1: INTRODUCTION
1.1 PROJECT OVERVIEW
For a long time, stock market forecasting has been a hot topic of study. Stock market prices,
according to the Efficient Market Hypothesis (EMH), are mostly driven by fresh information and
follow a random walk pattern. Several people have attempted to extract patterns in the way stock
markets operate and respond to external stimuli, even though this theory is largely acknowledged
by the scientific community as a basic paradigm regulating markets in general. The stock market is
a place where people can make a fortune if they can successfully predict future market movements.
Because the stock market is volatile and exhibits complicated behavior, making decisions is both
difficult and necessary. Investors are always looking for a better technique to forecast future stock
price behavior, which will help them determine the optimal moment to trade stocks in order to
maximize their returns. Studies have shown that the future trends of a company in stock markets
are heavily influenced by its past performance, and studies have shown that the company's image
also plays an important role in prediction. For example, negative news about the company can
have a significant impact on the market trend, leading to a downward movement. Before investing,
investors consider the company's past performance as well as the influence of recent news.
1.2 MOTIVATION
So, the major goal here is to develop a model for estimating stock market futures trends with a low
error ratio and improve prediction accuracy. Data mining can be used to extract information from
huge and complicated datasets, resulting in superior stock market trend predictions. To forecast
future stock market behavior, we will combine financial news sentiment analysis with attributes
taken from historical stock prices. We'll implement both the sentiment analysis and historical data
models separately, then combine their results to create an effective market prediction model.
1.3 ASSUMPTIONS AND CONSTRAINTS
Since the market today is dynamic, our analysis of the stock market is just a reference to
understanding the workings of the market using limited constraints. The basic assumption here is
that the historical data and sentiment analysis of news will be sufficient to predict market trends.
There should be enough tweets in order to increase the accuracy of sentiment analysis. Previous
month's stock data such as low, high, close, open, and volume should be available to us to train the
machine learning model.
1.4 NOVELTY OF WORK
Most of the existing models used for stock price prediction active in markets today simply provide
predictions based on historical data. However, there is a fundamental flaw in this approach. This
approach doesn't take into account real-world events. Stock prices do not depend only on the
previous prices. It is like the future is not entirely deterministic by historical events. News can also
have an impact on stock trends. It can be clearly observed that sometimes there are sudden
7
increments or decrements in stock prices, which are associated with real-world events. Therefore, a
more realistic scenario would be that the prices depend on the emotions of the investors and how
the media portrays the company, etc. The concept behind our model stands apart as it focuses not
only on historical data but also on present news using sentiment analysis on Twitter trends.
● The automated process of analyzing text to determine the sentiment communicated is known
as sentiment analysis (positive, negative, or neutral).
● We can take thousands of tweets about a company and evaluate if they are positive or negative
in real-time using sentiment analysis!
● Sentiment analysis is frequently used to predict direct changes in stock prices as a result of
immediate sentiment changes.
8
CHAPTER 2: LITERATURE SURVEY
2.1 LITERATURE SURVEY
In a recent study by Singh, P. K et al.7, sentiment analysis was done across Flipkart E-commerce
websites for filtering of irrelevant reviews, and MongoDB database technology was used at the
backend for this research work. In another study by Gunduz et al.9, sentiment analysis between
sentiments of people on social networks and the academic success of Turkish universities was done
to find out if there is any relation between a university’s academic success and sentiments about
those universities in the social media based on the Naive Bayes classifier. For this purpose, the top
10 most successful Turkish universities, ranked by URAP, were selected for analyzing sentiment
about them on social media. Twitter, which allows users to share tweets with social friends or
followers, was chosen as the specific social media for this study. Firstly, tweets were collected via
the Twitter REST API, after which tweets were labeled as positive, negative, or neutral.
Pre-processing of feature extraction was done by extracting meaningful special characters from
tweets. The tweets are then classified into a word list based on the two approaches: one was time
frequency and the other one was Inverse Document Frequency. From the results evaluated, the
success rate of the system was found to be 72.33%. Molla et al.10 made sentiment analysis for
user opinions about different Samsung products using different twitter official accounts of
Samsung Company. For visualizing the result of the user's opinion, data visualization tools such as
NodeXL were used for the social network graph. Future work was proposed to focus on the
location management of each tweet and the inclusion of emotions.
Lu, Y., and Chen, J. (2012) presented a study for the opinion analysis of microblog content. The
public opinion model was divided into four modules: data collection module, corpus processing
module, sentiment analysis module, and the data management module. For retrieving online
microblog content, crawlers were used, and for classifying microblogs, a text classification method
called support vector machine was used. The result shows that precision classification exceeded
90% with the use of a classifier support vector machine. It was proposed that more work could be
done to improve the performance of the support vector machine. Batool, R. et al. (1994) analyzed
4000 tweets to classify data and sentiment more precisely from Twitter, containing information
such as food, diet, diabetes, education, and movies. First a knowledge generator was used to
classify tweets into different categories, and then a knowledge enhancer with a synonym binder
was applied to increase the information gain. The Knowledge enhancer module adds additional
knowledge that was not extracted by the Alchemy API used in the knowledge generator phase. A
synonym binder was used to bind synonyms with entities and keywords extracted by the
knowledge generator and knowledge enhancer. Results showed that an overall significant
improvement of 0.1% to 55% had been achieved using the said approach. M. Meral & B. Diri25
performed sentiment analysis of Turkish tweets on nine different domains such as insurance, sport,
finance, food, automotive, politics, real estate, telecommunication, and health. The collection of
Turkish tweets was done by using Naive Bayes, Support Vector Machines, and Random Forest.
Classification of tweets was done as neutral, positive, and negative. The tweets were then divided
as- health, politics, finance, and telecommunications in the negative sentiment category; food, real
estate, sports, and automotive tweets in the neutral category; and the rest of the tweets as positive.
From the results obtained, it was concluded that support vector machines give the best results as
9
compared to other classifiers. Li, SWang et al. (27) applied sentiment analysis by using Twitter
data to predict the success rate of movies. For this purpose movies were classified as Flop, Hit, and
average. The tweets from 2009 to 2013 were extracted, and each tweet was classified as positive,
negative, neutral, or irrelevant. A Lingpipe sentiment analyzer was used to test the sentiments, and
results showed that the movie prediction accuracy of the developed system was 64.4 % better than
the conventional system. In another study conducted by Wang, X., & Luo28, for predicting the
movie performance based on social networking sites data using sentiment analysis technique.
Sentiments from various social media platforms, such as Twitter and YouTube, were collected.
Prediction of movies was done by using the K-means clustering algorithm.
2.1.1. THEORY ASSOCIATED WITH THE PROBLEM AREA
Businesses are primarily run on customer satisfaction and customer reviews of their products.
Shifts in sentiment on social media have been shown to correlate with shifts in stock markets.
Identifying customer grievances and resolving them leads to customer satisfaction as well as the
trustworthiness of an organization. Hence, there is a necessity for an unbiased automated system to
classify customer reviews regarding any problem. In today’s environment where we’re justifiably
suffering from data overload (although this does not mean better or deeper insights), companies
might have mountains of customer feedback collected; but for mere humans, it’s still impossible to
analyze it manually without any sort of error or bias. Oftentimes, companies with the best
intentions find themselves in an insight vacuum. You know you need insights to inform your
decision making and you know that you’re lacking them, but don’t know how best to get them.
Sentiment analysis provides some insight into what the most important issues are, from the
perspective of customers, at least. Because sentiment analysis can be automated, decisions can be
made based on a significant amount of data rather than plain intuition. Time series forecasting and
modeling play an important role in data analysis. Time series analysis is a specialized branch of
statistics used extensively in fields such as econometrics and operations research. Time series are
being widely used in analytics and data science. Stock prices are volatile in nature, and their price
depends on various factors. The main aim of this project is to predict stock prices using Long short
term memory (LSTM).
2.1.2. PROBLEMS FACED AND IDEA BEHIND THE APPROACH
The more we collectively understand how Amazon operates, the better we can all make informed
decisions on where to purchase products online. The more you know about the industry, the less
likely you are to eat factory-processed meat products. As such, by understanding the inner details,
we have been able to see the backend of Amazon’s operation, and what we have learned over the
period has been both illuminating and disturbing. This is what you most likely do not know about
Amazon: 68% of the products currently sold on Amazon are sold by THIRD-PARTY SELLERS
(like my business) and are not manufactured by Amazon. Amazon only manufactures roughly 30%
of the products that it sells on its site. This practice is called "private labeling." Currently, Amazon
owns over 90 different private label brands (i.e., it manufactures a product and then puts its own
“unique” brand name on it, such as Amazon Basics for tech products or Beauty Bar for cosmetics).
So how does Amazon figure out which products to manufacture and “private label?” How does
Amazon know what the consumer wants and which items will be profitable? The answers are
simple and, unfortunately, criminal. For starters, Amazon has access to all of its Third-Party seller
data. Sellers use Amazon’s Seller Portal to list and sell their products. Amazon collects its 30% fee
10
from them, they pay for and ship the product to the customer (more on this later), and the world is
done. done. done. happy. But Amazon sees their portal. It is known how many products they sell
every month. It can even calculate margins if you put in enough information into the “calculator”
that Amazon provides free of charge within its seller portal. So, when Amazon “sees” a product
that is selling a certain number of units per month (they have an automatic algorithm that
calculates this), your product is “flagged” by the powers that be at Amazon as a product that needs
to be copied/stolen/knocked off and manufactured by Amazon under a “new” brand name. And
with its powerful algorithms, Amazon can ensure that your product ends up buried at the bottom of
their search results pages while their new, shiny knock-off shows up at the top of the page when
you do your Amazon search! And just like that, Amazon destroys the small business that it has
taken fees from and used over the past few years. For years, Amazon has been stealing ideas from
sellers on its platform by using data that they “claim” was off-limits for them to use. ECommerce
sales such as Amazon and Flipkart have potentially destroyed the existing ecosystem of Indian
retail due to unethical and predatory business practices by large players such as Amazon and
Flipkart, then the future seems extremely bleak and grave for India’s small retailers.
Unfortunately, these two large companies who have almost 80 percent market share of India’s
eCommerce business have given our country the most maligned and vitiated foundation for
Ecommerce business. As a marketplace entity, their prime responsibility was to create a healthy
and thriving technological platform to promote the businesses of small sellers by connecting them
to potential buyers. But in stark contrast, their ulterior and shrouded business motive has been to
ensure that small offline retailers perish and shut business so that they can get a strong foothold in
India’s retail market. It is really painful to note that in the last 12 months more than 50,000 mobile
retailers, 30,000 electronics retailers, about 25,000 Kirana, and 35,000 garment retailers have shut
their business mainly due to these Ecommerce giants who have blatantly violated the Govt’s FDI
policy and indulged in inventory control, predatory pricing, preferential seller treatment, illegal
exclusivity among other violations.
2.1.3 THE PROBLEM IDENTIFIED
In the existing methods, we found that some approaches predict stock prices only based on
historical stock data, which seldom introduces unstructured text data into the financial field.
Although some methods considered the role of non-traditional data, they only investigated
financial news or social media information. To overcome these limitations, our goal is to predict
the prices of five stocks in India’s share market with multiple data sources and calculate the error
of the predicted prices. We first combine historical stock data, technical indicators, stock forum
posts and financial news. Then, we investigate text sentiment analysis based on convolutional
neural networks (CNN) to calculate the investor's sentiment tendency. Finally, we explored the
advantages of long short-term memory (LSTM) for processing time series data for predicting stock
prices. The experimental results show that the proposed method can fit multi-source data well and
achieve low error. Our contributions include three aspects:
● A LSTM framework is designed by incorporating multiple data sources and investors’
sentiment.
● Sentiment analysis method based on CNN is proposed to calculate the investor sentiment
index.
● LSTM network with an attention mechanism is proposed to predict stock prices.
11
2.2 RESEARCH GAPS
Using sentimental analysis along with previous data: Sentiment analysis is a particularly
interesting area of ​
​
natural language processing (NLP) used to assess the language used in a body of
text. Through sentiment analysis, you can take thousands of tweets about your company and
evaluate in real time whether they are generally positive or negative (sentiment). Many researchers
have found investor sentiment to be an important factor in financial markets. In some cases,
investors tend to buy stocks after good news is announced, which leads to higher stock prices.
After the bad news broke, they sold the stock and the price fell. Information on the Internet
provides a valuable resource for reflecting investor sentiment. Many researchers now use SA and
news analysis to predict stock prices.
Using an LSTM based model: LSTM neural networks are a derivative of RNNs. It not only
improves the long-term memory deficit of RNNs but also prevents the problem of vanishing
gradients. LSTM neural networks can dynamically learn and decide whether to make a given
output the next recursive input. Based on this mechanism that can store important information, we
provide an excellent reference and application for building predictive models for this study.
2.3 PROBLEM DEFINITION AND SCOPE
2.3.1 PROBLEM STATEMENT
Stock market prediction relies on factors such as interest rates, economic activity, and related
markets that influence the demand and supply of trading volume. Currently, stockbrokers who
execute trades and advise clients rely on their experience, technical analysis (price trends), or
fundamental analysis in picking their stocks. These current methods are subjective and usually
short-sighted due to their limited capacity to crunch raw numbers. With the value of trade money
involved, improper investment could easily mean great losses for investors, especially if they keep
making wrong decisions. The lack of guaranteed returns has also led to a reluctance by potential
investors to participate in the market. It is therefore desirable to have a model that can guide on the
most likely next day prices (prediction) as a basis for making any investment decision. This study
proposes text mining of financial news and public sentiments and opinions from social media such
as Twitter. The combination of market data and news features helps improve the accuracy of
predictions. Regardless, existing systems have failed to effectively integrate news features with
market data. With this, the results obtained are converted into numeric forms that feed the
prediction process.
2.3.2 SCOPE AND LIMITATIONS
The project is limited to only the company's shares listed on the NSE. Additionally, the company
should have traded for at least five years to ensure data consistency. The languages to be used in
the sentiment analysis process are English and Hindi. Use of slang in this case and in vernacular
language will not be considered. The assumption in this study is that there should be no form of
manipulation that could have a bigger effect on the prices of stock movements by either the
stockbrokers or any other affected parties.
12
2.4 INTRODUCTION
2.4.1 PURPOSE
The purpose of this SRS document is to provide a detailed overview of our software product, its
parameters, and goals. This document describes the project’s target audience and its user interface,
hardware, and software requirements. It defines how our audience and team will interact with the
product. This project aims to determine the future movement of the stock value of a financial
exchange. Accurate prediction of share price movement will lead to more profit investors can
make. Predicting how the stock market will move is one of the most challenging issues due to the
many factors that are involved in stock prediction, such as interest rates, politics, and economic
growth, that make the stock market volatile and very hard to predict accurately. The prediction of
shares offers huge chances for profit and is a major motivation for research in this area; knowledge
of stock movements by a fraction of a second can lead to high profits. Since stock investment is a
major financial market activity, a lack of accurate knowledge and detailed information would lead
to an inevitable loss of investment.
2.4.1.1 INTENDED AUDIENCE AND READING SUGGESTIONS
Small scale investors and people who want to learn about the trends of the stock market, also the
people who are largely affected by sudden changes in stock prices due to the manipulation of the
market by some famous personalities.
2.4.1.2 PROJECT SCOPE
The scope of our project is to predict the stock market data using different algorithms and study
their prediction efficiency. It is beneficial for companies and individuals to make proper
investment decisions.
2.4.2 OVERALL DESCRIPTION
2.4.2.1 PRODUCT PERSPECTIVE
There are many challenges involved in sentiment analysis. The main problems that exist are:
inability to perform well in different domains; inadequate accuracy and performance in sentiment
analysis based on insufficient labeled data; incapability to deal with complex sentences that require
more than sentiment words; and simple analysis. Our approach requires large amounts of labeled
news data for training and correctly predicting news sentiment. This data is, however, not easy to
obtain. Because of this, we are using pre-trained models. BERT can be optimized to perform well
in specialized use cases (like sentiment analysis of news), but its performance depends on the
quality of training data. TextBlob is a pretrained library. It provides a consistent API for diving into
common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase
extraction, and sentiment analysis. After testing both the libraries , TextBlob gave better results as
compared to BERT. However, if better labeled data is available, it is recommended to use BERT.
13
BERT:
BERT stands for Bidirectional Encoder Representations from Transformers. It is a
Transformer-based machine learning technique for natural language processing (NLP) pre-training.
It is designed to pre-train deep bidirectional representations from unlabeled text by jointly
conditioning on both left and right context. As a result, the pre-trained BERT model can be
fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of
NLP tasks.
TextBlob:
TextBlob is a Python library for Natural Language Processing (NLP). TextBlob actively uses
Natural Language ToolKit (NLTK) to achieve its tasks. NLTK is a library that gives easy access to
a lot of lexical resources and allows users to work with categorization, classification, and many
other tasks. TextBlob is a simple library that supports complex analysis and operations on textual
data. The textblob.sentiments module contains two sentiment analysis implementations,
PatternAnalyzer (based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier
trained on a movie review corpus).
Software Requirements:
● Operating System: windows 7 and above or Linux based OS or MAC OS
● Python 3.5 in Google Colab is used for data pre-processing, model training and prediction.
2.4.2.2 User Interfaces
The user interface (UI) is the point of human-computer interaction and communication in a device.
So, when the user opens Google Colab, he/she may choose the date from which he wants to
perform sentiment analysis and also the stock for which the sentiment analysis is to be performed.
2.4.2.3 Hardware Interfaces
The only hardware required is a laptop with Google Colab installed on it.
2.4.3 OTHER NON FUNCTIONAL REQUIREMENTS
2.4.3.1 PERFORMANCE REQUIREMENTS
Usability: It defines the user interface of the software in terms of its simplicity of understanding
the user interface of stock prediction software for any kind of stock trader and other stakeholders
in the stock market.
Efficiency: maintaining the possible highest accuracy in the closing stock prices in the shortest
time with available data.
Performance: It is a quality attribute of the stock prediction software that describes the
responsiveness to various user interactions with it.
14
2.5 APPROVED OBJECTIVES
Successful prediction of future prices of stocks. In today's competitive market, predicting stock
returns and a company's financial health in advance provides more benefits for investors to invest
with confidence. Accurately predicting stock price movements allows investors to earn more.
2.6 PROJECT OUTCOMES AND DELIVERABLES
Through this project, we intend to build a model that will predict future stock market prices of
companies using the Long Short-Term Memory (LSTM) approach. Other factors which we will
consider for prediction are open price, close price, low, high, and volume of previous days.
2.7 RISK ANALYSIS
Risk analysis is a key step in identifying undesirable scenarios with insufficient levels of
preparedness. Based on our research and expertise in this area, we can anticipate and mitigate the
impact of the maximum number of outcomes. We hope that this does not affect the development of
the project, but it is natural to not be able to cover all aspects of the domain.
VOLATILITY:
Volatility is the standard deviation of a stock's annual returns over a period, indicating the extent to
which its price may rise or fall. A stock is said to be highly volatile when it changes rapidly, making
new highs and lows in a short period of time. Volatility is said to be low if the stock price moves
slowly up and down or is relatively stable. Our model may give less accurate predictions if there is
high volatility in the market.
ECONOMIC VARIABLES:
It is possible that other variables influence certain correlations, such as economic variables that may
affect specific stocks at the macro level. Current economic stability news can indirectly affect
overall markets and other variables such as volatility by influencing investors' risk appetite and
market- or asset-specific sensitivity to downturns.
INCONSISTENT DATA:
Inadequate data or a lack of tweets may also lead to inaccurate predictions of prices. Stock market
forecasting is a major challenge due to non-stationary, noisy, and chaotic data.
15
CHAPTER 3 - FLOWCHART
3.1 WORKFLOW ARCHITECTURE
Fig. 1 Workflow Diagram
16
3.2 ACTIVITY DIAGRAM
Fig. 2 Activity Diagram
3.3 TOOLS AND TECHNOLOGIES USED
● Google Colab
● Python
● OpenCV
● VS Code
● NLP
● Machine learning
17
CHAPTER 4: PROJECT DESCRIPTION
4.1 LIBRARIES AND LANGUAGES USED
4.1.1. PYTHON
Python is an interpreted, high-level, general-purpose programming language. Created by Guido
van Rossum and first released in 1991, Python's design philosophy emphasizes code readability
with its notable use of significant whitespace. Its language constructs and object-oriented approach
aim to help programmers write clear, logical code for small and large-scale projects. Python is
dynamically typed and garbage collected. It supports multiple programming paradigms, including
structured (particularly procedural), object-oriented, and functional programming. Due to its
comprehensive standard library, Python is often described as a "batteries included" language.
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released
in 2000, introduced features like list comprehensions and a garbage collection system with
reference counting.
4.1.2 MACHINE LEARNING
Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify
patterns, and make decisions with minimal human intervention. Matplotlib is used for plotting
graphs. Scikit-learn is probably the most useful library for machine learning in Python. The sklearn
library contains a lot of efficient tools for machine learning and statistical modeling, including
classification, regression, clustering, and dimensionality reduction.
4.1.3 NLP
Natural language processing (NLP) refers to the field of computer science, more specifically
artificial intelligence (AI), which deals with giving computers the ability to understand texts and
spoken language in the same way as humans. NLP combines computational linguistics (rule-based
modeling of human language) with models of statistics, machine learning, and deep learning.
Combining these technologies, computers can process human speech in the form of text or audio
data and "understand" its full meaning with the intent and sensation of the speaker or writer. NLP
controls a computer program that translates text from one language to another, responds to voice
commands, and quickly summarizes large amounts of text in real time. There is a good chance that
we have interacted with NLP in the form of voice controlled GPS systems, digital assistants, voice
recognition dictation software, customer service chatbots, and other consumer conveniences.
However, NLP also plays a growing role in enterprise solutions that help streamline business
operations, increase employee productivity, and simplify mission-critical business processes.
4.1.4 SENTIMENTALANALYSIS
Data analysts use sentiment analysis to extract information for market research and monitor brand
and product reputation. This technique is also very helpful in knowing what the customer thinks
and acting on it to improve the so-called customer experience. In addition, companies involved in
data analysis typically integrate third-party APIs for sentiment analysis into their infrastructure to
18
extract useful insights and make them available to their customers. This article explains the
strengths and weaknesses of the rule-based sentiment analysis process and outlines the role of NLP
and machine learning techniques in how sentiment analysis works.
4.1.5 LSTM
Long short-term memory networks, commonly known as LSTMs, are a special type of recurrent
neural network that can learn and predict long sequences. In contrast to regular feedforward neural
networks, LSTMs have feedback connections. That is, you can process the entire data sequence,
not just individual data points. LSTMs have a default behavior of storing information over a long
period of time. Another additional benefit of LSTMs when learning long sequences is that you can
learn to make one-shot multi-step predictions. This is very useful for time series forecasting.
LSTM repeat units include cells, input gates, output gates, and oblivion gates. The cell contains
values ​
​
over a period of time, and the tags control the flow of information in and out of the cell.
Fig. 3 LSTM Diagram
The repeating module in an LSTM contains four interacting layers.
The compact forms of the equations for the forward pass of an LSTM unit are:
19
where the initial values are c0 = 0, h0 = 0, and the operator denotes the element-wise product. The
subscript t indexes the time step.
Here the variables are -
Activation functions:
g: sigmoid function
c : hyperbolic tangent function
h : hyperbolic tangent function
4.2 TOOLS
Sentiment analysis presents many challenges. The main problems are that it does not work well in
various areas; that the accuracy and performance of sentiment analysis based on poorly labeled
data is inadequate; and that it cannot handle complex sentences that require more than emotional
words. Ease of analysis Our approach requires a large amount of labeled news data to train and
accurately predict news sentiment, but retrieving this data is not easy. For this reason, use a
pre-trained model. BERT can be tuned to work properly for specific use cases (such as news
sentiment analysis), but its performance depends on the quality of the training data. Text Blob is a
pre-trained library. It provides a consistent API for jumping into common natural language
processing (NLP) tasks such as: B. part of speech tagging, noun phrase extraction, sentiment
analysis. After testing both libraries, TextBlob gave better results compared to BERT. However, if
better labeled data is available, we recommend using BERT.
BERT:
BERT stands for Bidirectional Encoder Representation from Transformers. This is a
Transformer-based machine learning technique developed by Google for pre-training in Natural
Language Processing (NLP). It is designed to pre-train deep bidirectional representations from
unlabeled text by coordinating the left and right contexts together. This allows you to refine your
pre-trained BERT model with just one additional layer of output to create state-of-the-art models
for a variety of NLP tasks.
20
Fig. 4 BERT Diagram
TextBlob:
TextBlob is a Python library for natural language processing (NLP). TextBlob actively used the
Natural Language Toolkit (NLTK) to perform the task. NLTK is a library that provides easy access
to many vocabulary resources and allows users to work with classifications and many other tasks.
TextBlob is a lightweight library that supports complex analysis and manipulation of text data. The
Textblob sentiments module contains two sentiment analysis. based on the pattern library) and
NaiveBayesAnalyzer (NLTK classifier trained on the movie review corpus).
4.3 PROCEDURE
Step 1 :
Data collection tweets from Microsoft, Google, and Apple are extracted from the Twitter API.
Tweets are collected using the Twitter API and filtered by keywords such as $MSFT, #Microsoft,
#Windows. Not only public opinion about the company's stock but also public opinion about the
products and services that it provides. The terms used for filtering have been meticulously
21
developed, and tweets are extracted to reflect the exact sentiment of the general public towards
Microsoft over a specific time period. You can also integrate Twitter news about Microsoft and
tweets about product releases. The opening and closing prices for Microsoft stock are provided by
Yahoo! Finance.
Step 2 :
Data Pre-Processing
Stock price data collected is not complete, understandably, because of weekends and public
holidays when the stock market does not function. The missing data is approximated using a
simple technique. Stock data usually follows a concave function. So, if the stock value on a given
day is x and the next value present is y, with some missing in between. The first missing value is
approximated to be (y+x)/2 and the same method is followed to fill all the gaps. Tweets consist of
many acronyms, emoticons, and unnecessary data like pictures and URLs.So, tweets are
pre-processed to represent the correct emotions of the public. For pre-processing tweets, we
employed three stages of filtering: tokenization, stop word removal, and regex matching for
removing special characters.
● Tokenization: Tweets are split into individual words based on the space available, and
irrelevant symbols like emoticons are removed. We form a list of individual words to be
removed. Form a list of individual words for each tweet
● Stop word removal: Words that do not express any emotion are called "stop words." After
splitting a tweet, words like a, is, the, with etc. are removed from the list of words.
● Regex matching for special character removal: Regex matching in Python is performed to
match URLs and they are replaced by the term URL.
Step 3 :
Sentiment Analysis
Sentiment analysis tasks are very much field specific. Tweets are classified as positive, negative, or
neutral based on the sentiment present. Of the total tweets are examined by humans and annotated
as 1 for positive, 0 for neutral, and 2 for negative emotions. For the classification of nonhuman
annotated tweets, a machine learning model is trained whose features are extracted from the human
annotated tweets.
Step 4 :
Feature Extraction
Textual representation can be done using n-grams. N-gram Representation: N-gram representation
is known for its specificity to match the corpus of text being studied. In these techniques, a full
corpus of related text is parsed, which are tweets in the present work, and every appearing word
sequence of length n is extracted from the tweets to form a dictionary of words and phrases. For
example, the text “Microsoft is launching a new product” has the following 3-gram word features:
“Microsoft is launching," “is launching a," “launching a new” and “a new product." In our case,
N-grams for all the tweets from the corpus In this representation, the tweet is split into N-grams
22
and the features of the model are a string of 1s and 0s, where 1 represents the presence of that
N-gram of the tweet in the corpus and a 0 indicates its absence.
Step 5 :
Model Training
The features extracted using the above methods for the tweets are fed to the classifier and trained
using classification methods like Logistic Regression, Decision Tree, SVM, and KNN to estimate
the movement of the change in stock market price versus the volume as well as sentiment of news
articles and tweets. Apply Linear Regression to find relations between the change in stock market
price and the volume as well as sentiment of news articles and tweets.
23
Chapter 5- Implementation and Experimental Results
5.1 Sample Code
24
25
26
27
28
29
30
31
32
33
34
35
5.2 Output and Accuracy
237/237 [==============================] - 258s 1s/step - loss: 0.7036 -
accuracy: 0.5013 - val_loss: 0.6978 - val_accuracy: 0.4500
Epoch 2/4
237/237 [==============================] - 256s 1s/step - loss: 0.6975 -
accuracy: 0.5167 - val_loss: 0.6992 - val_accuracy: 0.5700
Epoch 3/4
237/237 [==============================] - 255s 1s/step - loss: 0.6902 -
accuracy: 0.5384 - val_loss: 0.6859 - val_accuracy: 0.6000
Epoch 4/4
36
237/237 [==============================] - 256s 1s/step - loss: 0.6392 -
accuracy: 0.6490 - val_loss: 0.7409 - val_accuracy: 0.5000
We are able to attain 64% accuracy in predicting the stock value.
37
CHAPTER 6 - Outcome and Prospective learning
6.1 OUTCOMES
Our initiative focuses mostly on increasing productivity and resource utilization. Through this
project, we hope to create a Long Short-Term Memory (LSTM) approach model that can forecast
future stock market prices of corporations. Open price, close price, low, high, and volume from
prior days are other aspects that we will take into account when making a prediction
6.2 FUTURE SCOPES
● Enhance the user reliability and experience by improving GUI.
● Our future focus would include the addition of other variables that influence stock market
forecasting. Better estimation is guaranteed by increasing the number of parameters.
6.3 PROSPECTIVE LEARNING
The learning outcomes for the Capstone project are as follows:
● Developing new/multidisciplinary technical skills
● Using professional and technical terminology appropriately
● Effectively utilizing and troubleshooting a tool for the development of a technical solution
● Analyzing data to create information
● Creating a technical report with the usage of international standards
● Acquiring and evaluating information
6.4 CONCLUSION
The dataset we used to construct machine learning algorithms for stock market price prediction
worked out effectively. On the dataset, we used feature selection and data pre-processing. Our
machine learning model makes use of the LSTM method. Furthermore, we have identified and
extracted subjective material from user views, judgments, sentiments, attitudes, and emotions
using text mining and sentiment analysis for natural language processing (NLP).
38
CHAPTER 7 – PROJECT TIMELINE
7.1 GANTT CHART
Fig. 5 Gantt Chart
7.2 PROJECT TIMELINE
Month Work done/ Expected to be done
Feb Project planning and discussion with mentors
March Study from research papers, study of software requirements
April Finalizing design flow and studying LSTM and NLP algorithms
May Research on various social media API’s for sentiment analysis.
June Sentiment analysis of Twitter data
July Applying different NLP algorithms to get the most accurate results
August Applying BERT and TextBlob to predictions on a single stock
September Testing the accuracy with different stocks for different time durations
October Analyzing the results provided by our model
39
November Documentation and finalization of the project
REFERENCES
● http://cse.anits.edu.in/projects/projects2021C9.pdf
● https://www.leadingindia.ai/downloads/projects/SMA/sma_7.pdf
● https://www.tandfonline.com/doi/full/10.1080/09540091.2021.1940101?cookieSet=1
● https://arxiv.org/ftp/arxiv/papers/1607/1607.01958.pdf
● https://ieeexplore.ieee.org/document/8848203
● https://colah.github.io/posts/2015-08-Understanding-LSTMs/
● https://www.sciencedirect.com/science/article/pii/S157401371930084X
● https://www.researchgate.net/publication/328930285_Stock_Market_Prediction_Using_
Machine_Learning
● https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959635/
40

More Related Content

Similar to stock price prediction using sentiment analysis

Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine Learning
ijtsrd
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin Serigar
Sachin Serigar
 

Similar to stock price prediction using sentiment analysis (20)

Project report on Share Market application
Project report on Share Market applicationProject report on Share Market application
Project report on Share Market application
 
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
 
Stock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural NetworkStock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural Network
 
IRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and Prediction
 
Prediction system report and results-Jay Vora
Prediction system report and results-Jay VoraPrediction system report and results-Jay Vora
Prediction system report and results-Jay Vora
 
STOCK MARKET PREDICTION USING MACHINE LEARNING IN PYTHON
STOCK MARKET PREDICTION USING MACHINE LEARNING IN PYTHONSTOCK MARKET PREDICTION USING MACHINE LEARNING IN PYTHON
STOCK MARKET PREDICTION USING MACHINE LEARNING IN PYTHON
 
Stock Price Prediction Using Sentiment Analysis and Historic Data of Stock
Stock Price Prediction Using Sentiment Analysis and Historic Data of StockStock Price Prediction Using Sentiment Analysis and Historic Data of Stock
Stock Price Prediction Using Sentiment Analysis and Historic Data of Stock
 
An Overview Of Predictive Analysis Techniques And Applications
An Overview Of Predictive Analysis  Techniques And ApplicationsAn Overview Of Predictive Analysis  Techniques And Applications
An Overview Of Predictive Analysis Techniques And Applications
 
IRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep LearningIRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep Learning
 
Stock Market Prediction Analysis
Stock Market Prediction AnalysisStock Market Prediction Analysis
Stock Market Prediction Analysis
 
Survey Paper on Stock Prediction Using Machine Learning Algorithms
Survey Paper on Stock Prediction Using Machine Learning AlgorithmsSurvey Paper on Stock Prediction Using Machine Learning Algorithms
Survey Paper on Stock Prediction Using Machine Learning Algorithms
 
IRJET- Stock Market Prediction using Deep Learning and Sentiment Analysis
IRJET- Stock Market Prediction using Deep Learning and Sentiment AnalysisIRJET- Stock Market Prediction using Deep Learning and Sentiment Analysis
IRJET- Stock Market Prediction using Deep Learning and Sentiment Analysis
 
Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine Learning
 
Final Internship Report_Sachin Serigar
Final Internship Report_Sachin SerigarFinal Internship Report_Sachin Serigar
Final Internship Report_Sachin Serigar
 
Sentiment Analysis based Stock Forecast Application
Sentiment Analysis based Stock Forecast ApplicationSentiment Analysis based Stock Forecast Application
Sentiment Analysis based Stock Forecast Application
 
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
IRJET - Forecasting Stock Market Movement Direction using Sentiment Analysis ...
 
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSISSTOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
 
REAL ESTATE PRICE PREDICTION
REAL ESTATE PRICE PREDICTIONREAL ESTATE PRICE PREDICTION
REAL ESTATE PRICE PREDICTION
 
Regression and correlation
Regression and correlationRegression and correlation
Regression and correlation
 
IRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANNIRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANN
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 

stock price prediction using sentiment analysis

  • 1. STOCK PRICE PREDICTION USING SENTIMENT ANALYSIS A capstone project report submitted in partial fulfillment of the requirement for the award of the degree of Bachelor of Engineering Electronics and Communication Department Submitted By ● Team Member 1 Nitish Garg 101915020 ngarg_be19@thapar.edu ● Team Member 2 Mayukh Sharma 101915076 msharma_be19@thapar.edu ● Team Member 3 Shruti Singh 101956003 ssingh11_be19@thapar.edu ● Team Member 4 Riya Bajaj 101915092 rbajaj_be19@thapar.edu ● Team Member 5 Himank Jindal 101915023 hjindal_be19@thapar.edu Under the supervision of Dr. Rajesh Khanna (Professor, ECED) Dr. Surbhi Sharma (Associate Professor, ECED) Department of Electronics and Communication Engineering THAPAR INSTITUTE OF ENGINEERING & TECHNOLOGY, PATIALA, PUNJAB December 2022
  • 2. DECLARATION We hereby declare that the capstone project group report titled “Stock price Prediction using Sentiment Analysis” is an authentic record of our own work carried out at “Thapar Institute of Engineering and Technology, Patiala” as a capstone project in the seventh semester of B.E. (Electronics & Communication Engineering), under the guidance of “Dr. Rajesh Khanna” and “Dr. Surbhi Sharma” from January to December 2022. Date: Aug 15, 2022 Name Registration Number Signature Himank Jindal 101915023 Nitish Garg 101915020 Riya Bajaj 101915092 Mayukh Sharma 101915076 Shruti Singh 101956003 2
  • 3. ACKNOWLEDGEMENT We would like to express our thanks to our mentors Dr. Rajesh Khanna and Dr. Surbhi Sharma. They have been of great help in our project and are an indispensable resource of technical knowledge. They truly are amazing mentors to have. We are also thankful to Dr. Alpana Agarwal, Head, Electronics and Communication Engineering Department, the entire faculty and staff of the Electronics and Communication Engineering Department, and also our friends, who devoted their valuable time and helped us in all possible ways towards the successful completion of this project. We thank all those who have contributed either directly or indirectly towards this project. Lastly, we would also like to thank our families for their unyielding love and encouragement. They always wanted the best for us, and we admire their determination and sacrifice. Date: Aug 15, 2022 Name Registration Number Himank Jindal 101915023 Nitish Garg 101915020 Mayukh Sharma 101915076 Riya Bajaj 101915092 Shruti Singh 101956003 3
  • 4. ABSTRACT Prediction and analysis of the stock market are some of the most difficult tasks to execute. There are numerous reasons for this, including market volatility and a variety of other dependent and independent variables that influence the market value of a particular stock. Because of these factors, it is extremely difficult for any stock market expert to predict the market's rise and fall with great accuracy, and thus it becomes challenging for the investor to invest the money to make profits. We will develop a machine learning model to forecast stock prices using the Long Short-Term Memory (LSTM) approach. By multiplying and adding, they are utilized to make small modifications to the data. Long-term memory (LSTM) is a deep learning artificial recurrent neural network (RNN) architecture. LSTM has feedback connections, unlike standard feed-forward neural networks. It can handle both single data points (like photos) and entire data sequences (such as speech or video). Text mining and natural language processing (NLP) are used in sentiment analysis, also known as opinion mining, to identify and extract subjective material from users' opinions, assessments, feelings, attitudes, and emotions. The stock price is more likely to rise if the news sentiment is positive; the stock price is more likely to decrease if the news sentiment is negative. The goal of this research is to create a model that can predict news polarity, which can affect stock market trends. 4
  • 5. TABLE OF CONTENTS Declaration 2 Acknowledgment 3 Abstract 4 List of figures 6 Chapter 1: Introduction 7 1.1 Project overview 7 1.2 Motivation 7 1.3 Assumptions and constraints 7 1.4 Novelty of work 7 Chapter 2: Literature survey 8 2.1 Literature survey 8 2.2 Project timeline 9 2.3 Problem definition and scope 12 2.4 Risk analysis 12 2.5 Approved objective 13 2.6 Project outcomes and Deliverables 15 2.7 Risk Analysis 15 Chapter 3: Flowchart 15 3.1 Workflow architecture 16 3.2 Activity diagram 16 3.3 Tools and technologies used 17 Chapter 4: Project description 17 4.1 Libraries and languages used 18 4.2 Tools 18 4.3 Procedure 20 Chapter 5: Implementation and Experimental Results 21 5.1 Sample Code 21 5.2 Output and Accuracy 37 Chapter 6: Outcomes and Prospective learning 38 6.1 Outcomes 38 5
  • 6. 6.2 Future Scopes 38 6.3 Prospective learning 38 6.4 Conclusion 38 Chapter 7: Project timeline 39 7.1 Gantt chart 39 7.2 Project timeline 39 References 40 LIST OF FIGURES Figure No. Figure Content Page no. Figure 1 Architectural Growth Sequence 16 Figure 2 Activity diagram 17 Figure 3 LSTM diagram 18 Figure 4 BERT diagram 20 Figure 5 Gantt Chart 39 6
  • 7. CHAPTER 1: INTRODUCTION 1.1 PROJECT OVERVIEW For a long time, stock market forecasting has been a hot topic of study. Stock market prices, according to the Efficient Market Hypothesis (EMH), are mostly driven by fresh information and follow a random walk pattern. Several people have attempted to extract patterns in the way stock markets operate and respond to external stimuli, even though this theory is largely acknowledged by the scientific community as a basic paradigm regulating markets in general. The stock market is a place where people can make a fortune if they can successfully predict future market movements. Because the stock market is volatile and exhibits complicated behavior, making decisions is both difficult and necessary. Investors are always looking for a better technique to forecast future stock price behavior, which will help them determine the optimal moment to trade stocks in order to maximize their returns. Studies have shown that the future trends of a company in stock markets are heavily influenced by its past performance, and studies have shown that the company's image also plays an important role in prediction. For example, negative news about the company can have a significant impact on the market trend, leading to a downward movement. Before investing, investors consider the company's past performance as well as the influence of recent news. 1.2 MOTIVATION So, the major goal here is to develop a model for estimating stock market futures trends with a low error ratio and improve prediction accuracy. Data mining can be used to extract information from huge and complicated datasets, resulting in superior stock market trend predictions. To forecast future stock market behavior, we will combine financial news sentiment analysis with attributes taken from historical stock prices. We'll implement both the sentiment analysis and historical data models separately, then combine their results to create an effective market prediction model. 1.3 ASSUMPTIONS AND CONSTRAINTS Since the market today is dynamic, our analysis of the stock market is just a reference to understanding the workings of the market using limited constraints. The basic assumption here is that the historical data and sentiment analysis of news will be sufficient to predict market trends. There should be enough tweets in order to increase the accuracy of sentiment analysis. Previous month's stock data such as low, high, close, open, and volume should be available to us to train the machine learning model. 1.4 NOVELTY OF WORK Most of the existing models used for stock price prediction active in markets today simply provide predictions based on historical data. However, there is a fundamental flaw in this approach. This approach doesn't take into account real-world events. Stock prices do not depend only on the previous prices. It is like the future is not entirely deterministic by historical events. News can also have an impact on stock trends. It can be clearly observed that sometimes there are sudden 7
  • 8. increments or decrements in stock prices, which are associated with real-world events. Therefore, a more realistic scenario would be that the prices depend on the emotions of the investors and how the media portrays the company, etc. The concept behind our model stands apart as it focuses not only on historical data but also on present news using sentiment analysis on Twitter trends. ● The automated process of analyzing text to determine the sentiment communicated is known as sentiment analysis (positive, negative, or neutral). ● We can take thousands of tweets about a company and evaluate if they are positive or negative in real-time using sentiment analysis! ● Sentiment analysis is frequently used to predict direct changes in stock prices as a result of immediate sentiment changes. 8
  • 9. CHAPTER 2: LITERATURE SURVEY 2.1 LITERATURE SURVEY In a recent study by Singh, P. K et al.7, sentiment analysis was done across Flipkart E-commerce websites for filtering of irrelevant reviews, and MongoDB database technology was used at the backend for this research work. In another study by Gunduz et al.9, sentiment analysis between sentiments of people on social networks and the academic success of Turkish universities was done to find out if there is any relation between a university’s academic success and sentiments about those universities in the social media based on the Naive Bayes classifier. For this purpose, the top 10 most successful Turkish universities, ranked by URAP, were selected for analyzing sentiment about them on social media. Twitter, which allows users to share tweets with social friends or followers, was chosen as the specific social media for this study. Firstly, tweets were collected via the Twitter REST API, after which tweets were labeled as positive, negative, or neutral. Pre-processing of feature extraction was done by extracting meaningful special characters from tweets. The tweets are then classified into a word list based on the two approaches: one was time frequency and the other one was Inverse Document Frequency. From the results evaluated, the success rate of the system was found to be 72.33%. Molla et al.10 made sentiment analysis for user opinions about different Samsung products using different twitter official accounts of Samsung Company. For visualizing the result of the user's opinion, data visualization tools such as NodeXL were used for the social network graph. Future work was proposed to focus on the location management of each tweet and the inclusion of emotions. Lu, Y., and Chen, J. (2012) presented a study for the opinion analysis of microblog content. The public opinion model was divided into four modules: data collection module, corpus processing module, sentiment analysis module, and the data management module. For retrieving online microblog content, crawlers were used, and for classifying microblogs, a text classification method called support vector machine was used. The result shows that precision classification exceeded 90% with the use of a classifier support vector machine. It was proposed that more work could be done to improve the performance of the support vector machine. Batool, R. et al. (1994) analyzed 4000 tweets to classify data and sentiment more precisely from Twitter, containing information such as food, diet, diabetes, education, and movies. First a knowledge generator was used to classify tweets into different categories, and then a knowledge enhancer with a synonym binder was applied to increase the information gain. The Knowledge enhancer module adds additional knowledge that was not extracted by the Alchemy API used in the knowledge generator phase. A synonym binder was used to bind synonyms with entities and keywords extracted by the knowledge generator and knowledge enhancer. Results showed that an overall significant improvement of 0.1% to 55% had been achieved using the said approach. M. Meral & B. Diri25 performed sentiment analysis of Turkish tweets on nine different domains such as insurance, sport, finance, food, automotive, politics, real estate, telecommunication, and health. The collection of Turkish tweets was done by using Naive Bayes, Support Vector Machines, and Random Forest. Classification of tweets was done as neutral, positive, and negative. The tweets were then divided as- health, politics, finance, and telecommunications in the negative sentiment category; food, real estate, sports, and automotive tweets in the neutral category; and the rest of the tweets as positive. From the results obtained, it was concluded that support vector machines give the best results as 9
  • 10. compared to other classifiers. Li, SWang et al. (27) applied sentiment analysis by using Twitter data to predict the success rate of movies. For this purpose movies were classified as Flop, Hit, and average. The tweets from 2009 to 2013 were extracted, and each tweet was classified as positive, negative, neutral, or irrelevant. A Lingpipe sentiment analyzer was used to test the sentiments, and results showed that the movie prediction accuracy of the developed system was 64.4 % better than the conventional system. In another study conducted by Wang, X., & Luo28, for predicting the movie performance based on social networking sites data using sentiment analysis technique. Sentiments from various social media platforms, such as Twitter and YouTube, were collected. Prediction of movies was done by using the K-means clustering algorithm. 2.1.1. THEORY ASSOCIATED WITH THE PROBLEM AREA Businesses are primarily run on customer satisfaction and customer reviews of their products. Shifts in sentiment on social media have been shown to correlate with shifts in stock markets. Identifying customer grievances and resolving them leads to customer satisfaction as well as the trustworthiness of an organization. Hence, there is a necessity for an unbiased automated system to classify customer reviews regarding any problem. In today’s environment where we’re justifiably suffering from data overload (although this does not mean better or deeper insights), companies might have mountains of customer feedback collected; but for mere humans, it’s still impossible to analyze it manually without any sort of error or bias. Oftentimes, companies with the best intentions find themselves in an insight vacuum. You know you need insights to inform your decision making and you know that you’re lacking them, but don’t know how best to get them. Sentiment analysis provides some insight into what the most important issues are, from the perspective of customers, at least. Because sentiment analysis can be automated, decisions can be made based on a significant amount of data rather than plain intuition. Time series forecasting and modeling play an important role in data analysis. Time series analysis is a specialized branch of statistics used extensively in fields such as econometrics and operations research. Time series are being widely used in analytics and data science. Stock prices are volatile in nature, and their price depends on various factors. The main aim of this project is to predict stock prices using Long short term memory (LSTM). 2.1.2. PROBLEMS FACED AND IDEA BEHIND THE APPROACH The more we collectively understand how Amazon operates, the better we can all make informed decisions on where to purchase products online. The more you know about the industry, the less likely you are to eat factory-processed meat products. As such, by understanding the inner details, we have been able to see the backend of Amazon’s operation, and what we have learned over the period has been both illuminating and disturbing. This is what you most likely do not know about Amazon: 68% of the products currently sold on Amazon are sold by THIRD-PARTY SELLERS (like my business) and are not manufactured by Amazon. Amazon only manufactures roughly 30% of the products that it sells on its site. This practice is called "private labeling." Currently, Amazon owns over 90 different private label brands (i.e., it manufactures a product and then puts its own “unique” brand name on it, such as Amazon Basics for tech products or Beauty Bar for cosmetics). So how does Amazon figure out which products to manufacture and “private label?” How does Amazon know what the consumer wants and which items will be profitable? The answers are simple and, unfortunately, criminal. For starters, Amazon has access to all of its Third-Party seller data. Sellers use Amazon’s Seller Portal to list and sell their products. Amazon collects its 30% fee 10
  • 11. from them, they pay for and ship the product to the customer (more on this later), and the world is done. done. done. happy. But Amazon sees their portal. It is known how many products they sell every month. It can even calculate margins if you put in enough information into the “calculator” that Amazon provides free of charge within its seller portal. So, when Amazon “sees” a product that is selling a certain number of units per month (they have an automatic algorithm that calculates this), your product is “flagged” by the powers that be at Amazon as a product that needs to be copied/stolen/knocked off and manufactured by Amazon under a “new” brand name. And with its powerful algorithms, Amazon can ensure that your product ends up buried at the bottom of their search results pages while their new, shiny knock-off shows up at the top of the page when you do your Amazon search! And just like that, Amazon destroys the small business that it has taken fees from and used over the past few years. For years, Amazon has been stealing ideas from sellers on its platform by using data that they “claim” was off-limits for them to use. ECommerce sales such as Amazon and Flipkart have potentially destroyed the existing ecosystem of Indian retail due to unethical and predatory business practices by large players such as Amazon and Flipkart, then the future seems extremely bleak and grave for India’s small retailers. Unfortunately, these two large companies who have almost 80 percent market share of India’s eCommerce business have given our country the most maligned and vitiated foundation for Ecommerce business. As a marketplace entity, their prime responsibility was to create a healthy and thriving technological platform to promote the businesses of small sellers by connecting them to potential buyers. But in stark contrast, their ulterior and shrouded business motive has been to ensure that small offline retailers perish and shut business so that they can get a strong foothold in India’s retail market. It is really painful to note that in the last 12 months more than 50,000 mobile retailers, 30,000 electronics retailers, about 25,000 Kirana, and 35,000 garment retailers have shut their business mainly due to these Ecommerce giants who have blatantly violated the Govt’s FDI policy and indulged in inventory control, predatory pricing, preferential seller treatment, illegal exclusivity among other violations. 2.1.3 THE PROBLEM IDENTIFIED In the existing methods, we found that some approaches predict stock prices only based on historical stock data, which seldom introduces unstructured text data into the financial field. Although some methods considered the role of non-traditional data, they only investigated financial news or social media information. To overcome these limitations, our goal is to predict the prices of five stocks in India’s share market with multiple data sources and calculate the error of the predicted prices. We first combine historical stock data, technical indicators, stock forum posts and financial news. Then, we investigate text sentiment analysis based on convolutional neural networks (CNN) to calculate the investor's sentiment tendency. Finally, we explored the advantages of long short-term memory (LSTM) for processing time series data for predicting stock prices. The experimental results show that the proposed method can fit multi-source data well and achieve low error. Our contributions include three aspects: ● A LSTM framework is designed by incorporating multiple data sources and investors’ sentiment. ● Sentiment analysis method based on CNN is proposed to calculate the investor sentiment index. ● LSTM network with an attention mechanism is proposed to predict stock prices. 11
  • 12. 2.2 RESEARCH GAPS Using sentimental analysis along with previous data: Sentiment analysis is a particularly interesting area of ​ ​ natural language processing (NLP) used to assess the language used in a body of text. Through sentiment analysis, you can take thousands of tweets about your company and evaluate in real time whether they are generally positive or negative (sentiment). Many researchers have found investor sentiment to be an important factor in financial markets. In some cases, investors tend to buy stocks after good news is announced, which leads to higher stock prices. After the bad news broke, they sold the stock and the price fell. Information on the Internet provides a valuable resource for reflecting investor sentiment. Many researchers now use SA and news analysis to predict stock prices. Using an LSTM based model: LSTM neural networks are a derivative of RNNs. It not only improves the long-term memory deficit of RNNs but also prevents the problem of vanishing gradients. LSTM neural networks can dynamically learn and decide whether to make a given output the next recursive input. Based on this mechanism that can store important information, we provide an excellent reference and application for building predictive models for this study. 2.3 PROBLEM DEFINITION AND SCOPE 2.3.1 PROBLEM STATEMENT Stock market prediction relies on factors such as interest rates, economic activity, and related markets that influence the demand and supply of trading volume. Currently, stockbrokers who execute trades and advise clients rely on their experience, technical analysis (price trends), or fundamental analysis in picking their stocks. These current methods are subjective and usually short-sighted due to their limited capacity to crunch raw numbers. With the value of trade money involved, improper investment could easily mean great losses for investors, especially if they keep making wrong decisions. The lack of guaranteed returns has also led to a reluctance by potential investors to participate in the market. It is therefore desirable to have a model that can guide on the most likely next day prices (prediction) as a basis for making any investment decision. This study proposes text mining of financial news and public sentiments and opinions from social media such as Twitter. The combination of market data and news features helps improve the accuracy of predictions. Regardless, existing systems have failed to effectively integrate news features with market data. With this, the results obtained are converted into numeric forms that feed the prediction process. 2.3.2 SCOPE AND LIMITATIONS The project is limited to only the company's shares listed on the NSE. Additionally, the company should have traded for at least five years to ensure data consistency. The languages to be used in the sentiment analysis process are English and Hindi. Use of slang in this case and in vernacular language will not be considered. The assumption in this study is that there should be no form of manipulation that could have a bigger effect on the prices of stock movements by either the stockbrokers or any other affected parties. 12
  • 13. 2.4 INTRODUCTION 2.4.1 PURPOSE The purpose of this SRS document is to provide a detailed overview of our software product, its parameters, and goals. This document describes the project’s target audience and its user interface, hardware, and software requirements. It defines how our audience and team will interact with the product. This project aims to determine the future movement of the stock value of a financial exchange. Accurate prediction of share price movement will lead to more profit investors can make. Predicting how the stock market will move is one of the most challenging issues due to the many factors that are involved in stock prediction, such as interest rates, politics, and economic growth, that make the stock market volatile and very hard to predict accurately. The prediction of shares offers huge chances for profit and is a major motivation for research in this area; knowledge of stock movements by a fraction of a second can lead to high profits. Since stock investment is a major financial market activity, a lack of accurate knowledge and detailed information would lead to an inevitable loss of investment. 2.4.1.1 INTENDED AUDIENCE AND READING SUGGESTIONS Small scale investors and people who want to learn about the trends of the stock market, also the people who are largely affected by sudden changes in stock prices due to the manipulation of the market by some famous personalities. 2.4.1.2 PROJECT SCOPE The scope of our project is to predict the stock market data using different algorithms and study their prediction efficiency. It is beneficial for companies and individuals to make proper investment decisions. 2.4.2 OVERALL DESCRIPTION 2.4.2.1 PRODUCT PERSPECTIVE There are many challenges involved in sentiment analysis. The main problems that exist are: inability to perform well in different domains; inadequate accuracy and performance in sentiment analysis based on insufficient labeled data; incapability to deal with complex sentences that require more than sentiment words; and simple analysis. Our approach requires large amounts of labeled news data for training and correctly predicting news sentiment. This data is, however, not easy to obtain. Because of this, we are using pre-trained models. BERT can be optimized to perform well in specialized use cases (like sentiment analysis of news), but its performance depends on the quality of training data. TextBlob is a pretrained library. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, and sentiment analysis. After testing both the libraries , TextBlob gave better results as compared to BERT. However, if better labeled data is available, it is recommended to use BERT. 13
  • 14. BERT: BERT stands for Bidirectional Encoder Representations from Transformers. It is a Transformer-based machine learning technique for natural language processing (NLP) pre-training. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. TextBlob: TextBlob is a Python library for Natural Language Processing (NLP). TextBlob actively uses Natural Language ToolKit (NLTK) to achieve its tasks. NLTK is a library that gives easy access to a lot of lexical resources and allows users to work with categorization, classification, and many other tasks. TextBlob is a simple library that supports complex analysis and operations on textual data. The textblob.sentiments module contains two sentiment analysis implementations, PatternAnalyzer (based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier trained on a movie review corpus). Software Requirements: ● Operating System: windows 7 and above or Linux based OS or MAC OS ● Python 3.5 in Google Colab is used for data pre-processing, model training and prediction. 2.4.2.2 User Interfaces The user interface (UI) is the point of human-computer interaction and communication in a device. So, when the user opens Google Colab, he/she may choose the date from which he wants to perform sentiment analysis and also the stock for which the sentiment analysis is to be performed. 2.4.2.3 Hardware Interfaces The only hardware required is a laptop with Google Colab installed on it. 2.4.3 OTHER NON FUNCTIONAL REQUIREMENTS 2.4.3.1 PERFORMANCE REQUIREMENTS Usability: It defines the user interface of the software in terms of its simplicity of understanding the user interface of stock prediction software for any kind of stock trader and other stakeholders in the stock market. Efficiency: maintaining the possible highest accuracy in the closing stock prices in the shortest time with available data. Performance: It is a quality attribute of the stock prediction software that describes the responsiveness to various user interactions with it. 14
  • 15. 2.5 APPROVED OBJECTIVES Successful prediction of future prices of stocks. In today's competitive market, predicting stock returns and a company's financial health in advance provides more benefits for investors to invest with confidence. Accurately predicting stock price movements allows investors to earn more. 2.6 PROJECT OUTCOMES AND DELIVERABLES Through this project, we intend to build a model that will predict future stock market prices of companies using the Long Short-Term Memory (LSTM) approach. Other factors which we will consider for prediction are open price, close price, low, high, and volume of previous days. 2.7 RISK ANALYSIS Risk analysis is a key step in identifying undesirable scenarios with insufficient levels of preparedness. Based on our research and expertise in this area, we can anticipate and mitigate the impact of the maximum number of outcomes. We hope that this does not affect the development of the project, but it is natural to not be able to cover all aspects of the domain. VOLATILITY: Volatility is the standard deviation of a stock's annual returns over a period, indicating the extent to which its price may rise or fall. A stock is said to be highly volatile when it changes rapidly, making new highs and lows in a short period of time. Volatility is said to be low if the stock price moves slowly up and down or is relatively stable. Our model may give less accurate predictions if there is high volatility in the market. ECONOMIC VARIABLES: It is possible that other variables influence certain correlations, such as economic variables that may affect specific stocks at the macro level. Current economic stability news can indirectly affect overall markets and other variables such as volatility by influencing investors' risk appetite and market- or asset-specific sensitivity to downturns. INCONSISTENT DATA: Inadequate data or a lack of tweets may also lead to inaccurate predictions of prices. Stock market forecasting is a major challenge due to non-stationary, noisy, and chaotic data. 15
  • 16. CHAPTER 3 - FLOWCHART 3.1 WORKFLOW ARCHITECTURE Fig. 1 Workflow Diagram 16
  • 17. 3.2 ACTIVITY DIAGRAM Fig. 2 Activity Diagram 3.3 TOOLS AND TECHNOLOGIES USED ● Google Colab ● Python ● OpenCV ● VS Code ● NLP ● Machine learning 17
  • 18. CHAPTER 4: PROJECT DESCRIPTION 4.1 LIBRARIES AND LANGUAGES USED 4.1.1. PYTHON Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Python is dynamically typed and garbage collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented, and functional programming. Due to its comprehensive standard library, Python is often described as a "batteries included" language. Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting. 4.1.2 MACHINE LEARNING Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. Matplotlib is used for plotting graphs. Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling, including classification, regression, clustering, and dimensionality reduction. 4.1.3 NLP Natural language processing (NLP) refers to the field of computer science, more specifically artificial intelligence (AI), which deals with giving computers the ability to understand texts and spoken language in the same way as humans. NLP combines computational linguistics (rule-based modeling of human language) with models of statistics, machine learning, and deep learning. Combining these technologies, computers can process human speech in the form of text or audio data and "understand" its full meaning with the intent and sensation of the speaker or writer. NLP controls a computer program that translates text from one language to another, responds to voice commands, and quickly summarizes large amounts of text in real time. There is a good chance that we have interacted with NLP in the form of voice controlled GPS systems, digital assistants, voice recognition dictation software, customer service chatbots, and other consumer conveniences. However, NLP also plays a growing role in enterprise solutions that help streamline business operations, increase employee productivity, and simplify mission-critical business processes. 4.1.4 SENTIMENTALANALYSIS Data analysts use sentiment analysis to extract information for market research and monitor brand and product reputation. This technique is also very helpful in knowing what the customer thinks and acting on it to improve the so-called customer experience. In addition, companies involved in data analysis typically integrate third-party APIs for sentiment analysis into their infrastructure to 18
  • 19. extract useful insights and make them available to their customers. This article explains the strengths and weaknesses of the rule-based sentiment analysis process and outlines the role of NLP and machine learning techniques in how sentiment analysis works. 4.1.5 LSTM Long short-term memory networks, commonly known as LSTMs, are a special type of recurrent neural network that can learn and predict long sequences. In contrast to regular feedforward neural networks, LSTMs have feedback connections. That is, you can process the entire data sequence, not just individual data points. LSTMs have a default behavior of storing information over a long period of time. Another additional benefit of LSTMs when learning long sequences is that you can learn to make one-shot multi-step predictions. This is very useful for time series forecasting. LSTM repeat units include cells, input gates, output gates, and oblivion gates. The cell contains values ​ ​ over a period of time, and the tags control the flow of information in and out of the cell. Fig. 3 LSTM Diagram The repeating module in an LSTM contains four interacting layers. The compact forms of the equations for the forward pass of an LSTM unit are: 19
  • 20. where the initial values are c0 = 0, h0 = 0, and the operator denotes the element-wise product. The subscript t indexes the time step. Here the variables are - Activation functions: g: sigmoid function c : hyperbolic tangent function h : hyperbolic tangent function 4.2 TOOLS Sentiment analysis presents many challenges. The main problems are that it does not work well in various areas; that the accuracy and performance of sentiment analysis based on poorly labeled data is inadequate; and that it cannot handle complex sentences that require more than emotional words. Ease of analysis Our approach requires a large amount of labeled news data to train and accurately predict news sentiment, but retrieving this data is not easy. For this reason, use a pre-trained model. BERT can be tuned to work properly for specific use cases (such as news sentiment analysis), but its performance depends on the quality of the training data. Text Blob is a pre-trained library. It provides a consistent API for jumping into common natural language processing (NLP) tasks such as: B. part of speech tagging, noun phrase extraction, sentiment analysis. After testing both libraries, TextBlob gave better results compared to BERT. However, if better labeled data is available, we recommend using BERT. BERT: BERT stands for Bidirectional Encoder Representation from Transformers. This is a Transformer-based machine learning technique developed by Google for pre-training in Natural Language Processing (NLP). It is designed to pre-train deep bidirectional representations from unlabeled text by coordinating the left and right contexts together. This allows you to refine your pre-trained BERT model with just one additional layer of output to create state-of-the-art models for a variety of NLP tasks. 20
  • 21. Fig. 4 BERT Diagram TextBlob: TextBlob is a Python library for natural language processing (NLP). TextBlob actively used the Natural Language Toolkit (NLTK) to perform the task. NLTK is a library that provides easy access to many vocabulary resources and allows users to work with classifications and many other tasks. TextBlob is a lightweight library that supports complex analysis and manipulation of text data. The Textblob sentiments module contains two sentiment analysis. based on the pattern library) and NaiveBayesAnalyzer (NLTK classifier trained on the movie review corpus). 4.3 PROCEDURE Step 1 : Data collection tweets from Microsoft, Google, and Apple are extracted from the Twitter API. Tweets are collected using the Twitter API and filtered by keywords such as $MSFT, #Microsoft, #Windows. Not only public opinion about the company's stock but also public opinion about the products and services that it provides. The terms used for filtering have been meticulously 21
  • 22. developed, and tweets are extracted to reflect the exact sentiment of the general public towards Microsoft over a specific time period. You can also integrate Twitter news about Microsoft and tweets about product releases. The opening and closing prices for Microsoft stock are provided by Yahoo! Finance. Step 2 : Data Pre-Processing Stock price data collected is not complete, understandably, because of weekends and public holidays when the stock market does not function. The missing data is approximated using a simple technique. Stock data usually follows a concave function. So, if the stock value on a given day is x and the next value present is y, with some missing in between. The first missing value is approximated to be (y+x)/2 and the same method is followed to fill all the gaps. Tweets consist of many acronyms, emoticons, and unnecessary data like pictures and URLs.So, tweets are pre-processed to represent the correct emotions of the public. For pre-processing tweets, we employed three stages of filtering: tokenization, stop word removal, and regex matching for removing special characters. ● Tokenization: Tweets are split into individual words based on the space available, and irrelevant symbols like emoticons are removed. We form a list of individual words to be removed. Form a list of individual words for each tweet ● Stop word removal: Words that do not express any emotion are called "stop words." After splitting a tweet, words like a, is, the, with etc. are removed from the list of words. ● Regex matching for special character removal: Regex matching in Python is performed to match URLs and they are replaced by the term URL. Step 3 : Sentiment Analysis Sentiment analysis tasks are very much field specific. Tweets are classified as positive, negative, or neutral based on the sentiment present. Of the total tweets are examined by humans and annotated as 1 for positive, 0 for neutral, and 2 for negative emotions. For the classification of nonhuman annotated tweets, a machine learning model is trained whose features are extracted from the human annotated tweets. Step 4 : Feature Extraction Textual representation can be done using n-grams. N-gram Representation: N-gram representation is known for its specificity to match the corpus of text being studied. In these techniques, a full corpus of related text is parsed, which are tweets in the present work, and every appearing word sequence of length n is extracted from the tweets to form a dictionary of words and phrases. For example, the text “Microsoft is launching a new product” has the following 3-gram word features: “Microsoft is launching," “is launching a," “launching a new” and “a new product." In our case, N-grams for all the tweets from the corpus In this representation, the tweet is split into N-grams 22
  • 23. and the features of the model are a string of 1s and 0s, where 1 represents the presence of that N-gram of the tweet in the corpus and a 0 indicates its absence. Step 5 : Model Training The features extracted using the above methods for the tweets are fed to the classifier and trained using classification methods like Logistic Regression, Decision Tree, SVM, and KNN to estimate the movement of the change in stock market price versus the volume as well as sentiment of news articles and tweets. Apply Linear Regression to find relations between the change in stock market price and the volume as well as sentiment of news articles and tweets. 23
  • 24. Chapter 5- Implementation and Experimental Results 5.1 Sample Code 24
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. 35
  • 36. 5.2 Output and Accuracy 237/237 [==============================] - 258s 1s/step - loss: 0.7036 - accuracy: 0.5013 - val_loss: 0.6978 - val_accuracy: 0.4500 Epoch 2/4 237/237 [==============================] - 256s 1s/step - loss: 0.6975 - accuracy: 0.5167 - val_loss: 0.6992 - val_accuracy: 0.5700 Epoch 3/4 237/237 [==============================] - 255s 1s/step - loss: 0.6902 - accuracy: 0.5384 - val_loss: 0.6859 - val_accuracy: 0.6000 Epoch 4/4 36
  • 37. 237/237 [==============================] - 256s 1s/step - loss: 0.6392 - accuracy: 0.6490 - val_loss: 0.7409 - val_accuracy: 0.5000 We are able to attain 64% accuracy in predicting the stock value. 37
  • 38. CHAPTER 6 - Outcome and Prospective learning 6.1 OUTCOMES Our initiative focuses mostly on increasing productivity and resource utilization. Through this project, we hope to create a Long Short-Term Memory (LSTM) approach model that can forecast future stock market prices of corporations. Open price, close price, low, high, and volume from prior days are other aspects that we will take into account when making a prediction 6.2 FUTURE SCOPES ● Enhance the user reliability and experience by improving GUI. ● Our future focus would include the addition of other variables that influence stock market forecasting. Better estimation is guaranteed by increasing the number of parameters. 6.3 PROSPECTIVE LEARNING The learning outcomes for the Capstone project are as follows: ● Developing new/multidisciplinary technical skills ● Using professional and technical terminology appropriately ● Effectively utilizing and troubleshooting a tool for the development of a technical solution ● Analyzing data to create information ● Creating a technical report with the usage of international standards ● Acquiring and evaluating information 6.4 CONCLUSION The dataset we used to construct machine learning algorithms for stock market price prediction worked out effectively. On the dataset, we used feature selection and data pre-processing. Our machine learning model makes use of the LSTM method. Furthermore, we have identified and extracted subjective material from user views, judgments, sentiments, attitudes, and emotions using text mining and sentiment analysis for natural language processing (NLP). 38
  • 39. CHAPTER 7 – PROJECT TIMELINE 7.1 GANTT CHART Fig. 5 Gantt Chart 7.2 PROJECT TIMELINE Month Work done/ Expected to be done Feb Project planning and discussion with mentors March Study from research papers, study of software requirements April Finalizing design flow and studying LSTM and NLP algorithms May Research on various social media API’s for sentiment analysis. June Sentiment analysis of Twitter data July Applying different NLP algorithms to get the most accurate results August Applying BERT and TextBlob to predictions on a single stock September Testing the accuracy with different stocks for different time durations October Analyzing the results provided by our model 39
  • 40. November Documentation and finalization of the project REFERENCES ● http://cse.anits.edu.in/projects/projects2021C9.pdf ● https://www.leadingindia.ai/downloads/projects/SMA/sma_7.pdf ● https://www.tandfonline.com/doi/full/10.1080/09540091.2021.1940101?cookieSet=1 ● https://arxiv.org/ftp/arxiv/papers/1607/1607.01958.pdf ● https://ieeexplore.ieee.org/document/8848203 ● https://colah.github.io/posts/2015-08-Understanding-LSTMs/ ● https://www.sciencedirect.com/science/article/pii/S157401371930084X ● https://www.researchgate.net/publication/328930285_Stock_Market_Prediction_Using_ Machine_Learning ● https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959635/ 40