SlideShare a Scribd company logo
1 of 27
Download to read offline
TEXT ANALYTICS OF NEWS FOR THE
TRADING FLOOR
Group 19:
Debarshi Basu, Siddhartha Gupta, Yiyang Fan
Under the guidance of
1
CONTENT
ā€¢ Motivation
ā€¢ Data Collection from RSS feeds
ā€¢ Torturing Data, a.k.a. Data Mining Methodologies
ā€¢ Identifying freshness
ā€¢ Identifying impact
ā€¢ Some Awesome Results
ā€¢ Conclusion
ā€¢ Acknowledgements Q&A
2
OBJECTIVE: PROVIDE INTERFACE TO RELEVANT NEWS
o News Moves Prices !!
o Trader Needs to have fast access to relevant news
o News regarding Asset of Interest
o Fresh News
o News that Impacts Prices the most
3
MOTIVATION: CURRENT INEFFICIENCY
o Current inefficiency:
o Existing sources only allow keyword search
o E.g. Search for AAPL but miss out news on new iPhone specs
o Existing keyword search doesnā€™t differentiate news
o E.g. Key word search of AAPL doesnā€™t differentiate news between Apple quarterly
earning release and launch of new products
First 3 news
Next 3 news
DATA SOURCE: RSS FEEDS
ā€¢ Archived historical data from Bloomberg was unavailable
ā€¢ Collecting Rich Site Summary (RSS) feeds.
ā€¢ A standard for communicating information updates to subscribers
ā€¢ An XML based format, compatible with multiple platforms.
ā€¢ RSS feed sources:
o CNBC Top News o Reuters Business
o CNBC Business o Reuters Company News
o CNBC Economy o Financial Times Market
o CNBC Finance o Financial Times US Market
o Bloomberg o WSJ Business
o Reuters Money o WSJ Markets
DATA: COLLECTION
ā€¢ Time Horizon:
ā€¢ January 27th, 2015 to March 4th, 2015
ā€¢ Sample Size:
ā€¢ 7,127 pieces of news headlines (up to March 4th, 2015)
ā€¢ SQLite database:
Time Stamp Headline
[INTEGER] [TEXT]
ā€¦ā€¦ ā€¦ā€¦
1424204040 With fixed-income yields at record lows, a senior broker has told CNBC that now is the
perfect time for investors to sell and move into equities.
1424205000 Hereā€™s how to stop overspending, undersaving and racking up credit card debt.
1424205180 Hereā€™s what will happen to the market and individual stocks when underperforming
hedge funds are forced to chase this rally.
ā€¦ā€¦ ā€¦ā€¦
DATA: SUMMARY
ā€¢ Distribution of news grouped by days and dates
DATA: TOKENIZATION
8
Fresh News Stale News
o Define: A fresh news is one that contains information not
contained in any previous news item.
o Classification can be done using Support Vector Machine
o Supervised Learning : Data assigned labels
o News arrived in the last 2 days : Set Label +1
o News arrived earlier : Set Label - 1
BINARY CLASSIFICATION OF NEWS
o Used SVM for classification of headlines based on a label assigned to it. (New=+1, Old=-1)
o Maximizes the distance between the two hyperplanes separating fresh and stale headlines.
o Training done on 75% , validation on 25%
o Used Gaussian (Radial Basis Function) kernel.
Area Under the Curve
o 70 - 80% over time
o Shows good performance
Plot based on:
o 4150 headlines over 20 days
o 1037 out-of-sample headlines
o Written in python using scikit-learn
News articles released on the same day
o New about oil contains words that had been published in earlier news articles.
o JP Morganā€™s news was released on its investor day. Does not have commonality with any old news
ā€¢
ā€¢
ā€¢
ā€¢
GOAL
o Topic Modeling v.s. keyword search
o Isolate news about a particular asset class, say Oil.
o Regression
o Study the impact of news relating to ā€œOilā€ on Crude Oil Index.
o
o
o
o
o
o
o
o Documents are mixture of topics.
o Topics are probability distribution over words.
o Words can have high probability in multiple topics.
o By observing the presence of words in documents (posterior) we infer the probability distribution
of words in topics (prior) : Bayesian Inference
LDA
SPCA
o
o
o
o
ā€¢
ā€¢
ā€¢
ā€¢
MEASURING NEWS BY IMPACT
ā€¢ Oil chosen as an asset class
ā€¢ The tokenized news dictionary contains the candidate variables; however this list is vast
ā€¢ SPCA performed to identify keywords associated with Oil
ā€¢ Extracted news headlines containing the keywords, re-tokenized them
ā€¢ The reduced number of candidate variables are regressors for returns
ā€¢ We need a sparse solution for Ī²
ā€¢ Ridge regression
ā€¢ Iterative Hard Thresholding (IHT)
21
Data representation for regression analysis
RIDGE REGRESSION AND IHT OVERVIEW
ā—¦ Ridge regression, in principal is similar to OLS,
but imposes a penalty on L2 norm of Ī²
parameter
ā—¦ The equation to solve can then be given by:
š›½šœ† = argmin
š›½
š‘Œ āˆ’ š‘‹š›½ 2 + šœ† š›½ 2
2
s.t. šœ† ā‰„ 0
ā—¦ Where, Ī» is the complexity parameter
ā—¦ The closed form solution for the equation
can be given by:
š›½šœ† = āˆ‘ + šœ†š¼ āˆ’1
1
š‘›
š‘‹ š‘‡ š‘Œ
ā—¦ Setting a higher value for Ī» leads to a sparser
solution
22
ā—¦ Another way to obtain scarcity in the
solution is to limit the cardinality while
solving the minimization equation.
ā—¦ IHT limits the by introducing additional
conditions. The IHT equation for a least
square loss function can be given by:
š›½ = argmin
š›½
š‘Œ āˆ’ š‘‹š›½ 2
s.t. card supp š›½ ā‰¤ š¾
ā—¦ The cardinality condition does not lead
to a closed form solution, and hence
needs to be solved iteratively
Ridge Regression IHT
RESULTS: RIDGE REGRESSION
ā€¢ Representing the log returns as a dependent variable, the
results for ridge regression are given in the table on right
ā€¢ We can check the efficacy of results on validation dataset,
if positive words can identify positive returns and vice-versa
ā€¢ From the charts below, words can identify true positives
and true negatives reasonably well
23
Ridge regression: Positive and negative words
Ridge regression: Positive words
Ridge regression: Negative words
RESULTS: IHT
Here, following the same procedure, but evaluating positive and negative returns
separately, we get the list on right for positive and negative words
24
DUBAI, Feb 10 (Reuters) - State-run Abu Dhabi Gas Industries Co (GASCO)
and Abu Dhabi Gas Liquefaction Co (ADGAS) said on Tuesday they had
awarded about $1.6 billion worth of contracts to expand the countryā€™s
natural gas processing facilities.
Iterative Hard Thresholding: Positive words
Iterative Hard Thresholding : Negative words
WILLISTON, N.D. (Reuters) - Hedge fund Paulson & Co has boosted its stake in
Whiting Petroleum Corp to become the No. 1 shareholder in North Dakotaā€™s
largest oil producer, taking advantage of...
NEW YORK (Reuters) - Soros Fund Management LLC took new positions in the
energy sector in the fourth quarter, including stakes in Devon Energy Corp and
Transocean Ltd, a regulatory filing showed...
New headline corresponding to ā€œadvantage sorosā€ New headline corresponding to ā€œadgasā€
SUMMARY AND CONCLUSION
ā€¢ In our project, we have used machine learning tools to help a trader better
understand news and extract information relevant to his portfolio. The work
focused on developing analysis around:
ā€¢ Whether the news is fresh or stale
ā€¢ Identifying news that has high impact on an asset
Limitations and further work:
ā€¢ Data for analysis limited by RSS feeds
ā€¢ We observe that the words by themselves are not very insightful
ā€¢ Analyze covariance structure
25
ACKNOWLEDGEMENTS
ā€¢ We sincerely thank professor Laurent El Ghaoui for his time and patience.
ā€¢ Gratitude is due to Jeff Huang, Andrew Godbehere and Steven Yadlowsky.
ā€¢ Thanks to Eric and Matt.
26
APPENDIX
o West Texas Intermediate (WTI), also known as Texas light sweet, is
a grade of crude oil used as a benchmark in oil pricing. This grade
is described as light because of its relatively low density, and sweet
because of its low sulfur content.

More Related Content

Viewers also liked

Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...
Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...
Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...Marcial Pons Argentina
Ā 
Prueba diagnostik 10comude_2016
Prueba diagnostik 10comude_2016Prueba diagnostik 10comude_2016
Prueba diagnostik 10comude_2016yuli sanabria
Ā 
Ein Tag im Leben von Nicola
Ein Tag im Leben von NicolaEin Tag im Leben von Nicola
Ein Tag im Leben von NicolaSchool
Ā 
Liste der PrƤpositionen
Liste der PrƤpositionenListe der PrƤpositionen
Liste der PrƤpositionenSchool
Ā 
Research methodology and medical statistics book preview
Research methodology and medical statistics book previewResearch methodology and medical statistics book preview
Research methodology and medical statistics book previewarmarcayurveda
Ā 
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-Apps
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-AppsAber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-Apps
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-AppsGregor Biswanger
Ā 
Phonegap App Entwicklung
Phonegap App EntwicklungPhonegap App Entwicklung
Phonegap App Entwicklungheliossolutionsde
Ā 

Viewers also liked (9)

Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...
Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...
Derecho de daƱos, igualdad y justicia distributiva.AUTOR: Tsachi Keren-Paz.IS...
Ā 
10 a nik
10 a nik10 a nik
10 a nik
Ā 
Prueba diagnostik 10comude_2016
Prueba diagnostik 10comude_2016Prueba diagnostik 10comude_2016
Prueba diagnostik 10comude_2016
Ā 
8 a mur
8 a mur8 a mur
8 a mur
Ā 
Ein Tag im Leben von Nicola
Ein Tag im Leben von NicolaEin Tag im Leben von Nicola
Ein Tag im Leben von Nicola
Ā 
Liste der PrƤpositionen
Liste der PrƤpositionenListe der PrƤpositionen
Liste der PrƤpositionen
Ā 
Research methodology and medical statistics book preview
Research methodology and medical statistics book previewResearch methodology and medical statistics book preview
Research methodology and medical statistics book preview
Ā 
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-Apps
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-AppsAber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-Apps
Aber schnell! Top HTML5 Performance Tipps fĆ¼r Hybrid- und Web-Apps
Ā 
Phonegap App Entwicklung
Phonegap App EntwicklungPhonegap App Entwicklung
Phonegap App Entwicklung
Ā 

Similar to AFP_Group19_final

need to realize in r studio (regression).pptx
need to realize in r studio (regression).pptxneed to realize in r studio (regression).pptx
need to realize in r studio (regression).pptxSmarajitPaulChoudhur
Ā 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
Ā 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...KINSHIP digital
Ā 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Walter Adamson
Ā 
Forecasting SEO Traffic & Revenue
Forecasting SEO Traffic & RevenueForecasting SEO Traffic & Revenue
Forecasting SEO Traffic & RevenueJon Quinton
Ā 
Valuation of Startups: A Machine Learning Perspective
Valuation of Startups: A Machine Learning PerspectiveValuation of Startups: A Machine Learning Perspective
Valuation of Startups: A Machine Learning PerspectiveMaria Garkavenko
Ā 
EDA2_v3.pptx
EDA2_v3.pptxEDA2_v3.pptx
EDA2_v3.pptxArindam Roy
Ā 
2017 Supply Chains to Admire - 13 JUN 2017 report
2017 Supply Chains to Admire - 13 JUN 2017 report2017 Supply Chains to Admire - 13 JUN 2017 report
2017 Supply Chains to Admire - 13 JUN 2017 reportLora Cecere
Ā 
Core deposits 2013 am ifs april 2013_fp
Core deposits 2013 am ifs april 2013_fpCore deposits 2013 am ifs april 2013_fp
Core deposits 2013 am ifs april 2013_fpBank Risk Advisors
Ā 
US SEC Mandates, Python, and Financial Modeling
US SEC Mandates, Python, and Financial ModelingUS SEC Mandates, Python, and Financial Modeling
US SEC Mandates, Python, and Financial ModelingActiveState
Ā 
Markit dividend forecasts and their value
Markit dividend forecasts and their valueMarkit dividend forecasts and their value
Markit dividend forecasts and their valueThomas Matheson
Ā 
Supply Chains to Admire - 2018
 Supply Chains to Admire - 2018 Supply Chains to Admire - 2018
Supply Chains to Admire - 2018Lora Cecere
Ā 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfSourabhH1
Ā 
ANAP 2020 Presentation
ANAP 2020 PresentationANAP 2020 Presentation
ANAP 2020 Presentationhela ben amor
Ā 
Big Data Report - 16 JULY 2012
Big Data Report - 16 JULY 2012Big Data Report - 16 JULY 2012
Big Data Report - 16 JULY 2012Lora Cecere
Ā 
DealMarket Digest Issue85 - 1st March 2013
DealMarket Digest Issue85 - 1st March 2013DealMarket Digest Issue85 - 1st March 2013
DealMarket Digest Issue85 - 1st March 2013Urs Haeusler
Ā 
Aligned New Product Development (NPD) Approval Process
Aligned New Product Development (NPD) Approval ProcessAligned New Product Development (NPD) Approval Process
Aligned New Product Development (NPD) Approval ProcessElaine Twomey
Ā 

Similar to AFP_Group19_final (20)

need to realize in r studio (regression).pptx
need to realize in r studio (regression).pptxneed to realize in r studio (regression).pptx
need to realize in r studio (regression).pptx
Ā 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Ā 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Ā 
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Kinship Digital ASX GN8 Compliance Monitoring Social Media Examples Crown New...
Ā 
Forecasting SEO Traffic & Revenue
Forecasting SEO Traffic & RevenueForecasting SEO Traffic & Revenue
Forecasting SEO Traffic & Revenue
Ā 
Valuation of Startups: A Machine Learning Perspective
Valuation of Startups: A Machine Learning PerspectiveValuation of Startups: A Machine Learning Perspective
Valuation of Startups: A Machine Learning Perspective
Ā 
EDA2_v3.pptx
EDA2_v3.pptxEDA2_v3.pptx
EDA2_v3.pptx
Ā 
2017 Supply Chains to Admire - 13 JUN 2017 report
2017 Supply Chains to Admire - 13 JUN 2017 report2017 Supply Chains to Admire - 13 JUN 2017 report
2017 Supply Chains to Admire - 13 JUN 2017 report
Ā 
S&OP Journey, innocent
S&OP Journey, innocentS&OP Journey, innocent
S&OP Journey, innocent
Ā 
Evaluating Transitional Agreements with Article Level Metadata
Evaluating Transitional Agreements with Article Level MetadataEvaluating Transitional Agreements with Article Level Metadata
Evaluating Transitional Agreements with Article Level Metadata
Ā 
Core deposits 2013 am ifs april 2013_fp
Core deposits 2013 am ifs april 2013_fpCore deposits 2013 am ifs april 2013_fp
Core deposits 2013 am ifs april 2013_fp
Ā 
US SEC Mandates, Python, and Financial Modeling
US SEC Mandates, Python, and Financial ModelingUS SEC Mandates, Python, and Financial Modeling
US SEC Mandates, Python, and Financial Modeling
Ā 
Markit dividend forecasts and their value
Markit dividend forecasts and their valueMarkit dividend forecasts and their value
Markit dividend forecasts and their value
Ā 
Supply Chains to Admire - 2018
 Supply Chains to Admire - 2018 Supply Chains to Admire - 2018
Supply Chains to Admire - 2018
Ā 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdf
Ā 
ANAP 2020 Presentation
ANAP 2020 PresentationANAP 2020 Presentation
ANAP 2020 Presentation
Ā 
Big Data Report - 16 JULY 2012
Big Data Report - 16 JULY 2012Big Data Report - 16 JULY 2012
Big Data Report - 16 JULY 2012
Ā 
DealMarket Digest Issue85 - 1st March 2013
DealMarket Digest Issue85 - 1st March 2013DealMarket Digest Issue85 - 1st March 2013
DealMarket Digest Issue85 - 1st March 2013
Ā 
RRSP_Contribution
RRSP_ContributionRRSP_Contribution
RRSP_Contribution
Ā 
Aligned New Product Development (NPD) Approval Process
Aligned New Product Development (NPD) Approval ProcessAligned New Product Development (NPD) Approval Process
Aligned New Product Development (NPD) Approval Process
Ā 

AFP_Group19_final

  • 1. TEXT ANALYTICS OF NEWS FOR THE TRADING FLOOR Group 19: Debarshi Basu, Siddhartha Gupta, Yiyang Fan Under the guidance of 1
  • 2. CONTENT ā€¢ Motivation ā€¢ Data Collection from RSS feeds ā€¢ Torturing Data, a.k.a. Data Mining Methodologies ā€¢ Identifying freshness ā€¢ Identifying impact ā€¢ Some Awesome Results ā€¢ Conclusion ā€¢ Acknowledgements Q&A 2
  • 3. OBJECTIVE: PROVIDE INTERFACE TO RELEVANT NEWS o News Moves Prices !! o Trader Needs to have fast access to relevant news o News regarding Asset of Interest o Fresh News o News that Impacts Prices the most 3
  • 4. MOTIVATION: CURRENT INEFFICIENCY o Current inefficiency: o Existing sources only allow keyword search o E.g. Search for AAPL but miss out news on new iPhone specs o Existing keyword search doesnā€™t differentiate news o E.g. Key word search of AAPL doesnā€™t differentiate news between Apple quarterly earning release and launch of new products First 3 news Next 3 news
  • 5. DATA SOURCE: RSS FEEDS ā€¢ Archived historical data from Bloomberg was unavailable ā€¢ Collecting Rich Site Summary (RSS) feeds. ā€¢ A standard for communicating information updates to subscribers ā€¢ An XML based format, compatible with multiple platforms. ā€¢ RSS feed sources: o CNBC Top News o Reuters Business o CNBC Business o Reuters Company News o CNBC Economy o Financial Times Market o CNBC Finance o Financial Times US Market o Bloomberg o WSJ Business o Reuters Money o WSJ Markets
  • 6. DATA: COLLECTION ā€¢ Time Horizon: ā€¢ January 27th, 2015 to March 4th, 2015 ā€¢ Sample Size: ā€¢ 7,127 pieces of news headlines (up to March 4th, 2015) ā€¢ SQLite database: Time Stamp Headline [INTEGER] [TEXT] ā€¦ā€¦ ā€¦ā€¦ 1424204040 With fixed-income yields at record lows, a senior broker has told CNBC that now is the perfect time for investors to sell and move into equities. 1424205000 Hereā€™s how to stop overspending, undersaving and racking up credit card debt. 1424205180 Hereā€™s what will happen to the market and individual stocks when underperforming hedge funds are forced to chase this rally. ā€¦ā€¦ ā€¦ā€¦
  • 7. DATA: SUMMARY ā€¢ Distribution of news grouped by days and dates
  • 9.
  • 10. Fresh News Stale News o Define: A fresh news is one that contains information not contained in any previous news item. o Classification can be done using Support Vector Machine o Supervised Learning : Data assigned labels o News arrived in the last 2 days : Set Label +1 o News arrived earlier : Set Label - 1 BINARY CLASSIFICATION OF NEWS
  • 11. o Used SVM for classification of headlines based on a label assigned to it. (New=+1, Old=-1) o Maximizes the distance between the two hyperplanes separating fresh and stale headlines. o Training done on 75% , validation on 25% o Used Gaussian (Radial Basis Function) kernel.
  • 12. Area Under the Curve o 70 - 80% over time o Shows good performance Plot based on: o 4150 headlines over 20 days o 1037 out-of-sample headlines o Written in python using scikit-learn
  • 13. News articles released on the same day o New about oil contains words that had been published in earlier news articles. o JP Morganā€™s news was released on its investor day. Does not have commonality with any old news
  • 14. ā€¢ ā€¢ ā€¢ ā€¢ GOAL o Topic Modeling v.s. keyword search o Isolate news about a particular asset class, say Oil. o Regression o Study the impact of news relating to ā€œOilā€ on Crude Oil Index.
  • 16. o Documents are mixture of topics. o Topics are probability distribution over words. o Words can have high probability in multiple topics. o By observing the presence of words in documents (posterior) we infer the probability distribution of words in topics (prior) : Bayesian Inference
  • 20.
  • 21. MEASURING NEWS BY IMPACT ā€¢ Oil chosen as an asset class ā€¢ The tokenized news dictionary contains the candidate variables; however this list is vast ā€¢ SPCA performed to identify keywords associated with Oil ā€¢ Extracted news headlines containing the keywords, re-tokenized them ā€¢ The reduced number of candidate variables are regressors for returns ā€¢ We need a sparse solution for Ī² ā€¢ Ridge regression ā€¢ Iterative Hard Thresholding (IHT) 21 Data representation for regression analysis
  • 22. RIDGE REGRESSION AND IHT OVERVIEW ā—¦ Ridge regression, in principal is similar to OLS, but imposes a penalty on L2 norm of Ī² parameter ā—¦ The equation to solve can then be given by: š›½šœ† = argmin š›½ š‘Œ āˆ’ š‘‹š›½ 2 + šœ† š›½ 2 2 s.t. šœ† ā‰„ 0 ā—¦ Where, Ī» is the complexity parameter ā—¦ The closed form solution for the equation can be given by: š›½šœ† = āˆ‘ + šœ†š¼ āˆ’1 1 š‘› š‘‹ š‘‡ š‘Œ ā—¦ Setting a higher value for Ī» leads to a sparser solution 22 ā—¦ Another way to obtain scarcity in the solution is to limit the cardinality while solving the minimization equation. ā—¦ IHT limits the by introducing additional conditions. The IHT equation for a least square loss function can be given by: š›½ = argmin š›½ š‘Œ āˆ’ š‘‹š›½ 2 s.t. card supp š›½ ā‰¤ š¾ ā—¦ The cardinality condition does not lead to a closed form solution, and hence needs to be solved iteratively Ridge Regression IHT
  • 23. RESULTS: RIDGE REGRESSION ā€¢ Representing the log returns as a dependent variable, the results for ridge regression are given in the table on right ā€¢ We can check the efficacy of results on validation dataset, if positive words can identify positive returns and vice-versa ā€¢ From the charts below, words can identify true positives and true negatives reasonably well 23 Ridge regression: Positive and negative words Ridge regression: Positive words Ridge regression: Negative words
  • 24. RESULTS: IHT Here, following the same procedure, but evaluating positive and negative returns separately, we get the list on right for positive and negative words 24 DUBAI, Feb 10 (Reuters) - State-run Abu Dhabi Gas Industries Co (GASCO) and Abu Dhabi Gas Liquefaction Co (ADGAS) said on Tuesday they had awarded about $1.6 billion worth of contracts to expand the countryā€™s natural gas processing facilities. Iterative Hard Thresholding: Positive words Iterative Hard Thresholding : Negative words WILLISTON, N.D. (Reuters) - Hedge fund Paulson & Co has boosted its stake in Whiting Petroleum Corp to become the No. 1 shareholder in North Dakotaā€™s largest oil producer, taking advantage of... NEW YORK (Reuters) - Soros Fund Management LLC took new positions in the energy sector in the fourth quarter, including stakes in Devon Energy Corp and Transocean Ltd, a regulatory filing showed... New headline corresponding to ā€œadvantage sorosā€ New headline corresponding to ā€œadgasā€
  • 25. SUMMARY AND CONCLUSION ā€¢ In our project, we have used machine learning tools to help a trader better understand news and extract information relevant to his portfolio. The work focused on developing analysis around: ā€¢ Whether the news is fresh or stale ā€¢ Identifying news that has high impact on an asset Limitations and further work: ā€¢ Data for analysis limited by RSS feeds ā€¢ We observe that the words by themselves are not very insightful ā€¢ Analyze covariance structure 25
  • 26. ACKNOWLEDGEMENTS ā€¢ We sincerely thank professor Laurent El Ghaoui for his time and patience. ā€¢ Gratitude is due to Jeff Huang, Andrew Godbehere and Steven Yadlowsky. ā€¢ Thanks to Eric and Matt. 26
  • 27. APPENDIX o West Texas Intermediate (WTI), also known as Texas light sweet, is a grade of crude oil used as a benchmark in oil pricing. This grade is described as light because of its relatively low density, and sweet because of its low sulfur content.