This document outlines a project that uses natural language processing and machine learning techniques to analyze media headlines about IPOs and predict the trajectory of the stock price over the first three months. The project scraped over 5,000 headlines about 16 IPOs from various sources, performed sentiment analysis and feature selection, and tested several machine learning models to select a linear support vector machine model that best predicts positive or negative price movements. The document describes the methodology, data sources and wrangling processes, modeling approach, and concludes with plans for a product demonstration and areas for further improvement.
4. Introduction
• Intro to IPOs
• Hypothesis
Statement
• Dataset
• Architecture
Project Hypothesis: Sentiment
analysis of media headlines about
an IPO can be used to predict the
trajectory of the stock price over
the first three months.
7. Methodology &
Data Product
• Data scraping & wrangling
• Headline sentiment, NLP
• Feature selection
• Modeling and prediction
3 Data Sources:
• Dow Jones Factiva
• www.iposcoop.com
• Morningstar API
• Inner join merged into one dataframe
• Normalized prior to subsequent processing
8. Methodology &
Data Product
• Wrangling
• Headline sentiment, NLP
• Feature selection
• Modeling and prediction
Sentiment Analysis:
Empath vs. OpinionFinder
Built-in Lexicon vs. Built-in Lexicon
● Included both in model
Feature Selection:
● Early results were poor (32 – 52 intuitive features)
● CountVectorization to headline text -> 4K features
● Principal Component Analysis (k=0.95 ) → 2K
features
9. Methodology &
Data Product
• Data scraping & wrangling
• Headline sentiment, NLP
• Feature selection
• Modeling and prediction
Models Tested:
• LinearSVC (Support Vector Machine)
• NuSVC
• SVC
• Kneighbors
• SGDClassifier (Stochastic Gradient Descent)
Model Selected: LinearSVC
• Best at predicting price trajectory (both positive
and negative) over 90-day period
• LogisticRegression
• LogisticRegressionCV
• BaggingClassifier
• ExtraTreesClassifier
• RandomForestClassifier
• MultinomialNB (Naive Bayes)