Presented at QuantCon Singapore 2016, Quantopian's quantitative finance and algorithmic trading conference, November 11th.
The vast proliferation of data related to the financial industry introduces both new opportunities and challenges to quantitative investors. These challenges are often due to the nature of big data and include: volume, variety, and velocity.
In this talk, Dr. Cheng will take the audience on a tour of the “big-data production line” in InfoTrie and show how the financial news collected from various and customizable sources are transformed into quantitative signals in a real-time manner. The talk will touch on various kind of topics like sentiment analysis, entity detection, topic classification, and big-data tools.
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Finance” by Juan Cheng, Data Scientist at Infotrie
1. Real Time Machine Learning Architecture &
Sentiment Analysis
Quantcon 2016, Singapore
Juan CHENG, PHD
Data Scientist
cheng.juan@infotrie.com
www.infotrie.com
@infotrie
www.finsents.com
@finsents
2. ● About us
● News analytics in finance
● A news analytics case
○ Information extraction of text
○ Text feature extraction for machine learning classification
○ Big data tools applied
○ Architecture that combines all
4. FinSentS.com
➔ Real-time information
and trading portal
➔ Millions of sources /
Multilingual
➔ Saas or on premises
➔ Real-time Alerts
➔ Actionable signals
Sentiment Data
➔ Through API or 1/3 parties
➔ Up to 15 years of history
➔ Low latency / Tick by tick
➔ 50,000+ entities
➔ Stock, Forex, commodities,
index, Macroeconomic topics
etc…
Consultancy and Training
➔ Trading Technology
➔ Algorithmic trading
➔ Big Data
➔ Natural Language
Processing (NLP)
➔ Machine Learning
5. B.
No, I’m a quant. I
found it’s hard to
quantified news.
A.
No, I found news are
noisy. They are just
too much.
C.
Yes. But I found using
news is not very efficient.
I have to manually
related them to my
portfolio.
6. Access to News / News
management
- Visualization tools
- Filtering tools
- On demand view
Feed from multiple sources:
- Social Media
- Web based content
- Private sources
- Internal data
News Content Alerts
based on sentiment
indicator
Provide accurate
information from Big
Data environment and
pushed it front of Users
in real time for Risk
management
Dashboard
- Consolidated
Dashboard
- Portfolio Alerts
Actionable indicators
Users receive news
signals for trading /
hedging / risk
management based
sentiment indicator
Algo Trading / Robo Trading
Real Time algorithmic trading
Sentiment indicator and News
Analytics
Equity Research / Sales Team Hedging Trader / Prop Trader
- News Tag Cloud
- Filtering newsfeed with
Social media blotter, news
blotter
- Search Engine on demand
- Topics detection
- Rumours alerts
- News qualification per
importance
- Relevant information
from single screen
- Automatic Alert
- Integrated to OMS
Provide relevant news
analytics indicator for
hedging or trade idea
generation
Fully integrated news
analytics signals integrated
to algo trading strategies
7. Reuters
MARKET NEWS | Fri Oct 21, 2016 | 2:18am EDT
AT&T acquires Time Warner for $85 billion
NEW YORK- AT&T Inc said it agreed to buy Time Warner Inc for $85.4 billion,
the boldest move yet by a telecommunications company to acquire content to
stream over its high-speed network to attract a growing number of online
viewers.
The trend of consolidation comes as technology advances have been upending
traditional entertainment companies. Many in the industry believe that getting
bigger is the best way to compete with companies like Google, Apple, Netflix and
Facebook.
David Goldman and Paul R. La Monica contributed to this report.
8. Reuters
MARKET NEWS | Fri Oct 21, 2016 | 2:18am EDT
AT&T acquires Time Warner for $85 billion
NEW YORK- AT&T Inc said it agreed to buy Time Warner Inc for $85.4 billion,
the boldest move yet by a telecommunications company to acquire content to
stream over its high-speed network to attract a growing number of online
viewers.
The trend of consolidation comes as technology advances have been upending
traditional entertainment companies. Many in the industry believe that getting
bigger is the best way to compete with companies like Google, Apple, Netflix and
Facebook.
David Goldman and Paul R. La Monica contributed to this report.
Source
Category
Time
Location
Named Entity
Sentiment
Event
Hacking skill, regex,nlp, named entity recognition, pos taggers
9. Train Document Set:
d1: The sky is blue.
d2: The sun is bright.
Test Document Set:
d3: The sun in the sky is bright.
d4: We can see the shining sun, the
bright sun.
Vector Space Model (VSM)
t1 t2...
d1
d2 ...
10. Train Document Set:
d1: The sky is blue.
d2: The sun is bright.
Vocabulary
Term frequency(TF)
11. TF emphasize a term which is almost present in the entire corpus
TD-IDF
TF example IDF example
Normalized TD-IDF
12. Train Document Set:
d1: The sky is blue.
d2: The sun is bright.
Test Document Set:
d3: The sun in the sky is bright.
d4: We can see the shining sun, the
bright sun.
Vector Space Model (VSM)
t1 t2...
d1
d2 ...
Machine Learning
13. - Companies, indexes
- People, locations, organizations
- Events
- Regions
NLP
Text
- Dow Jones, bloomberg
- Web news, blogs, twitter
- 1000+ sources
Feature Extraction
Classification
Sentiment
- 15 years history
- Tens of millions of articles
Training
Indexing
- Sector/industry
- Commodity, FX, ETFs
- Political, country risk
- Macroeconomic
- Fear, greed, anger,
happiness
Aggregation
14. ❏ Guaranteed data processing
❏ Horizontal scalability
❏ Fault-tolerance
❏ Higher level abstraction than message passing
❏ Real-time machine learning for classification and predictive
analytics
16. Fast and general engine for large-scale distributed data processing
Memory Network CPU’s Disk
Reference: spark
Logistic regression in Hadoop and Spark
17. open source distributed realtime computation system, easily process unbounded streams of data
Storm was benchmarked at
processing one million 100
byte messages per second
per node on hardware with the
following specs:
● Processor: 2x Intel
E5645@2.4Ghz
● Memory: 24 GB
Reference: storm
Spout
bolt
18. ✓ Guaranteed data processing
✓ Horizontal scalability
✓ Fault-tolerance
✓ Higher level abstraction than message
passing
✓ Real-time machine learning for
classification and predictive analytics
22. Sentiment in itself is a powerful trading indicator out of which
multiple trading strategies can be build
Simulate impact of
complex events
23. MIFID alert
Improve Client's communication
Regulatory
Process complex / low signals
events
ESG monitoring
Ecological – Social –
Governance
An union calls for
a strike in a
factory in
Argentina?
Negative news coverage is
accelerating for a stock I
hold in Chinese press but
are not yet in English press?
A European company
employs children in
Bangladesh (*)?
ACTIONS