Text Analytics- An application in Indian Stock Markets


Published on

This presentation was created to present the project done as a part of Applied Management Research Project in Vinod Gupta School of Management, IIT Kharagpur

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Analytics- An application in Indian Stock Markets

  1. 1. Vinod Gupta School of Management, IIT Kharagpur Text Analytics- An Application in Indian Stock Market Applied Management Research Project, 2014 By Sinjana Ghosh Done under the able guidance of Prof. A. K. Misra
  2. 2. Background Motivation behind this project
  3. 3. Algorithmic Trading in India  Involves the use of algorithms in pre-built platforms to place electronic trades on stocks, futures, options, currencies and commodities on exchanges, without any human intervention  In 2008, India allowed the first Direct-Market-Access (DMA) and algorithmic trades to go through  The most commonly used strategies of algorithmic trading in India include arbitrage, market making and trend following algorithms
  4. 4. Big Data  Data available in various forms – not just structured but also semi-structured like XML and EDI Documents and unstructured like Text, multimedia etc.  Big Data analytics is the strategy of using this huge amount of data which is now accessible through internet, mobile messages and various other platforms, to extract useful information , that can be further analyzed to help in the decision making process
  5. 5. Text Data analytics  Subset of Big data analytics which involves extraction of entities like person, location, organization etc. from text messages and relationship between the extracted entities and analysing them for business needs Predictive analytics  Involves searching for meaningful relationships among variables and representing those relationships in models  Response variables and explanatory variables  Two common types of model: Regression and Classification
  6. 6. Sentiment Analysis  Use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials  Aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document Machine Learning  A branch of artificial intelligence, concerns the construction and study of systems that can learn from data
  7. 7. The Problem Using text mining of news articles available in the public domain to analyse the market sentiment and correlate it with the actual movement in Nifty 50
  8. 8.  Use textual news from a plethora of online resources to perform data mining to check for occurrence of a basic set of keywords in the article.  Training a machine learning algorithm for accurately predicting the impact of the most viewed news articles on the market sentiment and predict the movement of market represented in the study by Nifty50.  Validate the results obtained through training set using a set of recent news articles (Test set) to check for errors and level of accuracy. Objective
  9. 9. Methodology  Textual Representation  Bag of words  Noun Phrasing  Named Entities  Named Entities with context-capturing feature  Predictive Modelling Approach Source: Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R (Mill)
  10. 10. Methodology  Sources of textual data
  11. 11. Methodology  Partitioning data in machine learning Source: Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R (Mill)
  12. 12. Text Analysis Algorithm 1. Convert all the characters to lowercase 2. Remove stop-words which does not help in sentiment analysis like “is”, “are”, “if”, “when”, “where”, “then”, “their”, “there”, “where”, “why”, “when”, “which”, “how” After this the following is done: 1. Create an array of named entities which are of significance like “inflation”, “gdp”, “sensex” etc. 2. The script is run which extracts the named entities which occur in the article along with the 2 words immediately preceding and 3 words immediately succeeding it. This is done to not only capture the keywords but also the context. 3. The algorithm is trained by assigning weights to each of the keyword so that the sentiment score most closely reflects the actual returns of the day.
  13. 13. Text Analysis Algorithm 4. A set of qualifiers is defined and the preceding and succeeding words captured as “context” of the extracted keyword. The algorithm further assigns a weight (-1 for negative, 0 for neutral and +1 for positive) to each extracted qualifiers. 5. The sum product of the qualifier weight and keyword weight gives the actual sentiment score of the article from which the returns of the day due to that news can be predicted. 6. Importance score is simply the sum of the weights of the individual occurrence of keywords in the article. However, whether the effect will be positive or negative, and how much the market will react to it is determined only by the sentiment score. 7. Regression is performed on the scores versus actual returns for the training set and a formula is obtained for converting the scores into forecasted returns. 8. This is tested on the validation set and errors are calculated.
  14. 14. Training of algorithm  Training set: Daily returns of 2013-14 with returns>1% or returns<1%  Several iterations were run and regression was performed at each level to finalize the set of keywords in the lexicon, weights of each keyword, set of qualifiers and their scores, and the set of exceptional items in the lexicon  Started iteration with 50 articles ended with 125 articles
  15. 15. Analysis and Results 125 news articles in the training set were analyzed using the script in R and the following are extracted: • All the named entities occurring in the news article that match with the lexicon • Capture the context in which they appear by extracting the preceding as well as succeeding words of the named entity
  16. 16. Interesting observations  The number of keywords that a news article contains has a much lesser bearing on the effect of the news article on the market as does the context in which it appears. Based simply on the occurrence of keywords 35 news articles got importance score greater than 80 but when sentiment score was calculated most of the context led to neutral scoring (0) thus leading to low sentiment score suggesting low returns ( both on the positive as well as negative side)  The keywords assigned highest weight while training of the algorithm are :  RBI  Rupee  Inflation  GDP
  17. 17. Interesting observations  Names of specific indices, or industries or results of specific companies which contain terms like “quarterly”, “results”, “annual”, “profit”, “revenue” etc. are least useful in evaluating the sentiment of the overall market represented by Nifty  When the Gold prices came down drastically, markets in most nations fell as gold mutual funds incurred huge losses. However, in India broad indices outperformed on the same event, which goes on to show that the prices of precious metals have inverse effect on the Indian stock market as a whole. So gold has also been included in the list of exceptional items in the lexicon.
  18. 18. Prediction Accuracy  Summary of Training set results:
  19. 19. Prediction Accuracy  Line Fit plot for training set:  Line Fit plot for test set:
  20. 20. An Example from test set  March 24, 2014
  21. 21. An Example from test set
  22. 22. Dataset and analysis
  23. 23. Workspace showing the list of keywords
  24. 24. Conclusion and scope of further work
  25. 25. Conclusion  The algorithm used in the study along the weights given to the terms in lexicon and qualifiers is able to predict daily market returns effectively for daily returns greater than equal to 1% (positive or negative)  Indian stock market does react to systemically important news articles  Textual analysis of publicly available of news articles have significant predictive quality  As efficiency of Indian market increases hence arbitrage opportunities will be less, so algorithmic traders will have significant advantage over manual traders if text analytics is implemented in algorithmic trading
  26. 26. Scope of further work  News articles can be clustered or classified into “economic news”, “political news” and “other news” based on the frequency of specific named entities to find out which type of news have greatest impact on the Indian market  If minute-wise market returns are available then news articles can be collected every hour and the returns can be observed over a period to find how much time it requires a news article of a certain importance score to affect the market  This text mining algorithm is not fully automated. The news articles need to be fed manually into the program for it to run and predict the returns. However this process can be automated to obtain live news feed from websites and automatically predict its importance and sentiment score. If the score is higher or lower than a particular range, then BUY or SELL (or short sell) calls can be taken automatically by the machine.
  27. 27. Thank you! Questions?