Aligned New Product Development (NPD) Approval Process
Ā
AFP_Group19_final
1. TEXT ANALYTICS OF NEWS FOR THE
TRADING FLOOR
Group 19:
Debarshi Basu, Siddhartha Gupta, Yiyang Fan
Under the guidance of
1
2. CONTENT
ā¢ Motivation
ā¢ Data Collection from RSS feeds
ā¢ Torturing Data, a.k.a. Data Mining Methodologies
ā¢ Identifying freshness
ā¢ Identifying impact
ā¢ Some Awesome Results
ā¢ Conclusion
ā¢ Acknowledgements Q&A
2
3. OBJECTIVE: PROVIDE INTERFACE TO RELEVANT NEWS
o News Moves Prices !!
o Trader Needs to have fast access to relevant news
o News regarding Asset of Interest
o Fresh News
o News that Impacts Prices the most
3
4. MOTIVATION: CURRENT INEFFICIENCY
o Current inefficiency:
o Existing sources only allow keyword search
o E.g. Search for AAPL but miss out news on new iPhone specs
o Existing keyword search doesnāt differentiate news
o E.g. Key word search of AAPL doesnāt differentiate news between Apple quarterly
earning release and launch of new products
First 3 news
Next 3 news
5. DATA SOURCE: RSS FEEDS
ā¢ Archived historical data from Bloomberg was unavailable
ā¢ Collecting Rich Site Summary (RSS) feeds.
ā¢ A standard for communicating information updates to subscribers
ā¢ An XML based format, compatible with multiple platforms.
ā¢ RSS feed sources:
o CNBC Top News o Reuters Business
o CNBC Business o Reuters Company News
o CNBC Economy o Financial Times Market
o CNBC Finance o Financial Times US Market
o Bloomberg o WSJ Business
o Reuters Money o WSJ Markets
6. DATA: COLLECTION
ā¢ Time Horizon:
ā¢ January 27th, 2015 to March 4th, 2015
ā¢ Sample Size:
ā¢ 7,127 pieces of news headlines (up to March 4th, 2015)
ā¢ SQLite database:
Time Stamp Headline
[INTEGER] [TEXT]
ā¦ā¦ ā¦ā¦
1424204040 With fixed-income yields at record lows, a senior broker has told CNBC that now is the
perfect time for investors to sell and move into equities.
1424205000 Hereās how to stop overspending, undersaving and racking up credit card debt.
1424205180 Hereās what will happen to the market and individual stocks when underperforming
hedge funds are forced to chase this rally.
ā¦ā¦ ā¦ā¦
10. Fresh News Stale News
o Define: A fresh news is one that contains information not
contained in any previous news item.
o Classification can be done using Support Vector Machine
o Supervised Learning : Data assigned labels
o News arrived in the last 2 days : Set Label +1
o News arrived earlier : Set Label - 1
BINARY CLASSIFICATION OF NEWS
11. o Used SVM for classification of headlines based on a label assigned to it. (New=+1, Old=-1)
o Maximizes the distance between the two hyperplanes separating fresh and stale headlines.
o Training done on 75% , validation on 25%
o Used Gaussian (Radial Basis Function) kernel.
12. Area Under the Curve
o 70 - 80% over time
o Shows good performance
Plot based on:
o 4150 headlines over 20 days
o 1037 out-of-sample headlines
o Written in python using scikit-learn
13. News articles released on the same day
o New about oil contains words that had been published in earlier news articles.
o JP Morganās news was released on its investor day. Does not have commonality with any old news
14. ā¢
ā¢
ā¢
ā¢
GOAL
o Topic Modeling v.s. keyword search
o Isolate news about a particular asset class, say Oil.
o Regression
o Study the impact of news relating to āOilā on Crude Oil Index.
16. o Documents are mixture of topics.
o Topics are probability distribution over words.
o Words can have high probability in multiple topics.
o By observing the presence of words in documents (posterior) we infer the probability distribution
of words in topics (prior) : Bayesian Inference
21. MEASURING NEWS BY IMPACT
ā¢ Oil chosen as an asset class
ā¢ The tokenized news dictionary contains the candidate variables; however this list is vast
ā¢ SPCA performed to identify keywords associated with Oil
ā¢ Extracted news headlines containing the keywords, re-tokenized them
ā¢ The reduced number of candidate variables are regressors for returns
ā¢ We need a sparse solution for Ī²
ā¢ Ridge regression
ā¢ Iterative Hard Thresholding (IHT)
21
Data representation for regression analysis
22. RIDGE REGRESSION AND IHT OVERVIEW
ā¦ Ridge regression, in principal is similar to OLS,
but imposes a penalty on L2 norm of Ī²
parameter
ā¦ The equation to solve can then be given by:
š½š = argmin
š½
š ā šš½ 2 + š š½ 2
2
s.t. š ā„ 0
ā¦ Where, Ī» is the complexity parameter
ā¦ The closed form solution for the equation
can be given by:
š½š = ā + šš¼ ā1
1
š
š š š
ā¦ Setting a higher value for Ī» leads to a sparser
solution
22
ā¦ Another way to obtain scarcity in the
solution is to limit the cardinality while
solving the minimization equation.
ā¦ IHT limits the by introducing additional
conditions. The IHT equation for a least
square loss function can be given by:
š½ = argmin
š½
š ā šš½ 2
s.t. card supp š½ ā¤ š¾
ā¦ The cardinality condition does not lead
to a closed form solution, and hence
needs to be solved iteratively
Ridge Regression IHT
23. RESULTS: RIDGE REGRESSION
ā¢ Representing the log returns as a dependent variable, the
results for ridge regression are given in the table on right
ā¢ We can check the efficacy of results on validation dataset,
if positive words can identify positive returns and vice-versa
ā¢ From the charts below, words can identify true positives
and true negatives reasonably well
23
Ridge regression: Positive and negative words
Ridge regression: Positive words
Ridge regression: Negative words
24. RESULTS: IHT
Here, following the same procedure, but evaluating positive and negative returns
separately, we get the list on right for positive and negative words
24
DUBAI, Feb 10 (Reuters) - State-run Abu Dhabi Gas Industries Co (GASCO)
and Abu Dhabi Gas Liquefaction Co (ADGAS) said on Tuesday they had
awarded about $1.6 billion worth of contracts to expand the countryās
natural gas processing facilities.
Iterative Hard Thresholding: Positive words
Iterative Hard Thresholding : Negative words
WILLISTON, N.D. (Reuters) - Hedge fund Paulson & Co has boosted its stake in
Whiting Petroleum Corp to become the No. 1 shareholder in North Dakotaās
largest oil producer, taking advantage of...
NEW YORK (Reuters) - Soros Fund Management LLC took new positions in the
energy sector in the fourth quarter, including stakes in Devon Energy Corp and
Transocean Ltd, a regulatory filing showed...
New headline corresponding to āadvantage sorosā New headline corresponding to āadgasā
25. SUMMARY AND CONCLUSION
ā¢ In our project, we have used machine learning tools to help a trader better
understand news and extract information relevant to his portfolio. The work
focused on developing analysis around:
ā¢ Whether the news is fresh or stale
ā¢ Identifying news that has high impact on an asset
Limitations and further work:
ā¢ Data for analysis limited by RSS feeds
ā¢ We observe that the words by themselves are not very insightful
ā¢ Analyze covariance structure
25
26. ACKNOWLEDGEMENTS
ā¢ We sincerely thank professor Laurent El Ghaoui for his time and patience.
ā¢ Gratitude is due to Jeff Huang, Andrew Godbehere and Steven Yadlowsky.
ā¢ Thanks to Eric and Matt.
26
27. APPENDIX
o West Texas Intermediate (WTI), also known as Texas light sweet, is
a grade of crude oil used as a benchmark in oil pricing. This grade
is described as light because of its relatively low density, and sweet
because of its low sulfur content.