Dr. Chuan-Ju Wang gave a talk on textual data analytics in finance. He discussed how natural language processing and text analytics can be used to analyze unstructured text data, such as financial reports, to gain insights for financial applications like risk prediction. Specifically, he described how sentiment analysis of financial reports using finance-specific lexicons can predict stock return volatility and relative risk levels of companies. He also discussed using continuous word embeddings to automatically expand financial lexicons with related keywords.
This workshop introduces the use of concept mapping (not mind mapping!) for identifying structure in complex texts, and for creating structure as you write. Cmap Tools is a freeware that is very suitable for structure work related to your writing. Visit https://cmap.ihmc.us/ to download Cmap Tools freeware and study with their excellent resources.
[Webinar Slides] Tapping the Power of Content Analytics – Exploring this Powe...AIIM International
Check out these webinar slide to learn how to take the guesswork out of unstructured analytics and begin to use it to connect the dots, make smarter business decisions, and put your data to work for you. Let’s stop talking about analytics and start benefitting from it!
Want to follow along with the webinar replay? Download it here for free: http://info.aiim.org/using-analytics-to-connect-dots
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
This workshop introduces the use of concept mapping (not mind mapping!) for identifying structure in complex texts, and for creating structure as you write. Cmap Tools is a freeware that is very suitable for structure work related to your writing. Visit https://cmap.ihmc.us/ to download Cmap Tools freeware and study with their excellent resources.
[Webinar Slides] Tapping the Power of Content Analytics – Exploring this Powe...AIIM International
Check out these webinar slide to learn how to take the guesswork out of unstructured analytics and begin to use it to connect the dots, make smarter business decisions, and put your data to work for you. Let’s stop talking about analytics and start benefitting from it!
Want to follow along with the webinar replay? Download it here for free: http://info.aiim.org/using-analytics-to-connect-dots
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Shanghai International Program for Development Evaluation Training Asia-Pacific Finance and Development Center; 200 Panlong Road-Shanghai, October 16, 2008
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlAlexander Panchenko
We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project. The sentences are processed with a dependency parser and with a named entity tagger and contain provenance information, enabling various applications ranging from training syntax-based word embeddings to open information extraction and question answering. We built an index of all sentences and their linguistic meta-data enabling quick search across the corpus. We demonstrate the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia. This distributional model outperforms the state of art models of verb similarity trained on smaller corpora on the SimVerb3500 dataset.
http://www.lrec-conf.org/proceedings/lrec2018/summaries/215.html
The Chinese government has set ambitious goals in its big data industry development to foster new economic drivers. One of these goals is e.g. to increase the annual sales of China’s big data industry (including related goods and services) to RMB 1 trillion by 2020 from an estimated RMB 280 billion in 2015. This report examines the Chinese big data industry and its innovators along with possible future opportunities and implications that China's expanding big data industry could entail for Finland.
IC-SDV 2018: Stefan Geißler (Expert System) Navigating to new shores: the Bio...Dr. Haxel Consult
We present the latest developments around the Biopharma Navigator, a consolidated large search, analysis and reporting application for tens of millions of biomedical documents. In its latest version the application has expanded to include yet more document sources, is offering real-time data-driven dashboards, an enhanced API that facilitates integration into third-party environments, advances in expert identification, the extension of the pharmacovigilance approach to new sources from news and social media as well as live extension of drug name repositories and clinical trial monitoring.
The Biopharma Navigator is used by a growing number of experts in the industry for their daily analyses and can be employed either on a simple subscription basis or with an on-premise installation. The Biopharma Navigator is our answer for the question how big data, cognitive computing analysis and intuitive webfrontends can be combined to provide broad and up-to-date information access to Life Science professionals.
The Role of Venture Capital in the US EconomyMark J. Feldman
National Venture Capital Association
Venture Capital’s Voice:
Public Policy & American Competitiveness
Robert E. Grady
Managing Director, The Carlyle Group
Chairman, NVCA
Chicago, Illinois
December 6, 2006
assessment 1 Submission dat e 14 - Apr- 2018 0833AM.docxfestockton
assessment 1
Submission dat e : 14 - Apr- 2018 08:33AM (UT C+1000)
Submission ID:
File name: Word
count:
936
Charact e r
count :
507 1
the regime?
mannix
Pencil
mannix
Pencil
mannix
Pencil
mannix
Typewriter
is increasing or decreasing the tax good or bad ?
Does this need research?
Descriptive only?About 33%
By
collecti
on of
taxes
What do you mean by trAde offs?
How are you going to research this? Can you get
answers. Bullet #7 is government policy? To answer
bullet #6 you need to have economic activity with high
tax then measure economic activity with low tax - how
are you going to do this?
Which state?
more?
Any idea of their conclusions???
Are they based on research or
beliefs? to present
Privacy of information etc??
FINAL GRADE
8/15
assessment 1
GRADEMARK REPORT
GENERAL COMMENTS
Instructor
PAGE 1
Strikethrough.
Text Comment. the regime?
PAGE 2
Text Comment. Do es this need research? Descriptive o nly?
Text Comment. Abo ut 33%
Text Comment. By co llectio n o f taxes
Strikethrough.
Text Comment. What do yo u mean by trAde o f f s?
Text Comment. Ho w are yo u go ing to research this? Can yo u get answers. Bullet # 7 is
go vernment po licy? T o answer bullet # 6 yo u need to have eco no mic activity with high tax then
measure eco no mic activity with lo w tax - ho w are yo u go ing to do this?
PAGE 3
Text Comment. Which state?
Text Comment. mo re?
Text Comment. Any idea o f their co nclusio ns???
Text Comment. Are they based o n research o r belief s?
Text Comment. to present
PAGE 4
Text Comment. Privacy o f inf o rmatio n etc??
RUBRIC: BUS70 7 RESEARCH PLAN T120 18
RES PROBLEM
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
(20)
RES OBJECT IVE
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
(20)
LIT REVIEW
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
55 / 10 0
9.80 / 20
Research pro blem and backgro und to the pro blem are no t welldevelo ped in respect to
signif icance and clarity
Research pro blem and backgro und to the pro blem are so mewhat develo ped in
respect to signif icance and clarity
Research pro blem and backgro und to the pro blem well- develo ped in respect to
signif icance and clarity
Research pro blem and backgro und to the pro blem are very well- develo ped in respect
to signif icance and clarity
Research pro blem and backgro und to the pro blem are expertly develo ped in respect
to signif icance and clarity
12.80 / 20
Research questio ns and research o bjectives are no t well- develo ped with respect to
research questio ns and their links to research o bjectives.
Research questio ns and research o bjectives are so mewhat develo ped with respect to
research questio ns and their links to research o bjectives.
Research questio ns and research o bjectives are well- develo ped with respect to ...
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingOntotext
A presentation of Ontotext’s CEO Atanas Kiryakov, given during Semantics 2018 - an annual conference that brings together researchers and professionals from all over the world to share knowledge and expertise on semantic computing.
Tracxn Research - Tutoring Landscape, January 2017Tracxn
Beijing-based VIPKID, an online English tutoring platform, raised the largest round of the year — a $100M in Series C funding from Yunfeng Capital, Sequoia Capital and Bryant Stibel in August 2016.
F-Prime Capital prepared a market analysis for 2018's year-end discussion. We are sharing it with our broader community in the hope that someone will find in informative, interesting or at least entertaining.
Shanghai International Program for Development Evaluation Training Asia-Pacific Finance and Development Center; 200 Panlong Road-Shanghai, October 16, 2008
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlAlexander Panchenko
We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project. The sentences are processed with a dependency parser and with a named entity tagger and contain provenance information, enabling various applications ranging from training syntax-based word embeddings to open information extraction and question answering. We built an index of all sentences and their linguistic meta-data enabling quick search across the corpus. We demonstrate the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia. This distributional model outperforms the state of art models of verb similarity trained on smaller corpora on the SimVerb3500 dataset.
http://www.lrec-conf.org/proceedings/lrec2018/summaries/215.html
The Chinese government has set ambitious goals in its big data industry development to foster new economic drivers. One of these goals is e.g. to increase the annual sales of China’s big data industry (including related goods and services) to RMB 1 trillion by 2020 from an estimated RMB 280 billion in 2015. This report examines the Chinese big data industry and its innovators along with possible future opportunities and implications that China's expanding big data industry could entail for Finland.
IC-SDV 2018: Stefan Geißler (Expert System) Navigating to new shores: the Bio...Dr. Haxel Consult
We present the latest developments around the Biopharma Navigator, a consolidated large search, analysis and reporting application for tens of millions of biomedical documents. In its latest version the application has expanded to include yet more document sources, is offering real-time data-driven dashboards, an enhanced API that facilitates integration into third-party environments, advances in expert identification, the extension of the pharmacovigilance approach to new sources from news and social media as well as live extension of drug name repositories and clinical trial monitoring.
The Biopharma Navigator is used by a growing number of experts in the industry for their daily analyses and can be employed either on a simple subscription basis or with an on-premise installation. The Biopharma Navigator is our answer for the question how big data, cognitive computing analysis and intuitive webfrontends can be combined to provide broad and up-to-date information access to Life Science professionals.
The Role of Venture Capital in the US EconomyMark J. Feldman
National Venture Capital Association
Venture Capital’s Voice:
Public Policy & American Competitiveness
Robert E. Grady
Managing Director, The Carlyle Group
Chairman, NVCA
Chicago, Illinois
December 6, 2006
assessment 1 Submission dat e 14 - Apr- 2018 0833AM.docxfestockton
assessment 1
Submission dat e : 14 - Apr- 2018 08:33AM (UT C+1000)
Submission ID:
File name: Word
count:
936
Charact e r
count :
507 1
the regime?
mannix
Pencil
mannix
Pencil
mannix
Pencil
mannix
Typewriter
is increasing or decreasing the tax good or bad ?
Does this need research?
Descriptive only?About 33%
By
collecti
on of
taxes
What do you mean by trAde offs?
How are you going to research this? Can you get
answers. Bullet #7 is government policy? To answer
bullet #6 you need to have economic activity with high
tax then measure economic activity with low tax - how
are you going to do this?
Which state?
more?
Any idea of their conclusions???
Are they based on research or
beliefs? to present
Privacy of information etc??
FINAL GRADE
8/15
assessment 1
GRADEMARK REPORT
GENERAL COMMENTS
Instructor
PAGE 1
Strikethrough.
Text Comment. the regime?
PAGE 2
Text Comment. Do es this need research? Descriptive o nly?
Text Comment. Abo ut 33%
Text Comment. By co llectio n o f taxes
Strikethrough.
Text Comment. What do yo u mean by trAde o f f s?
Text Comment. Ho w are yo u go ing to research this? Can yo u get answers. Bullet # 7 is
go vernment po licy? T o answer bullet # 6 yo u need to have eco no mic activity with high tax then
measure eco no mic activity with lo w tax - ho w are yo u go ing to do this?
PAGE 3
Text Comment. Which state?
Text Comment. mo re?
Text Comment. Any idea o f their co nclusio ns???
Text Comment. Are they based o n research o r belief s?
Text Comment. to present
PAGE 4
Text Comment. Privacy o f inf o rmatio n etc??
RUBRIC: BUS70 7 RESEARCH PLAN T120 18
RES PROBLEM
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
(20)
RES OBJECT IVE
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
(20)
LIT REVIEW
FAIL
(9.80)
PASS
(12.80)
CREDIT
(14.80)
DIST INCT ION
(16.80)
HIGH DIST INCT ION
55 / 10 0
9.80 / 20
Research pro blem and backgro und to the pro blem are no t welldevelo ped in respect to
signif icance and clarity
Research pro blem and backgro und to the pro blem are so mewhat develo ped in
respect to signif icance and clarity
Research pro blem and backgro und to the pro blem well- develo ped in respect to
signif icance and clarity
Research pro blem and backgro und to the pro blem are very well- develo ped in respect
to signif icance and clarity
Research pro blem and backgro und to the pro blem are expertly develo ped in respect
to signif icance and clarity
12.80 / 20
Research questio ns and research o bjectives are no t well- develo ped with respect to
research questio ns and their links to research o bjectives.
Research questio ns and research o bjectives are so mewhat develo ped with respect to
research questio ns and their links to research o bjectives.
Research questio ns and research o bjectives are well- develo ped with respect to ...
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingOntotext
A presentation of Ontotext’s CEO Atanas Kiryakov, given during Semantics 2018 - an annual conference that brings together researchers and professionals from all over the world to share knowledge and expertise on semantic computing.
Tracxn Research - Tutoring Landscape, January 2017Tracxn
Beijing-based VIPKID, an online English tutoring platform, raised the largest round of the year — a $100M in Series C funding from Yunfeng Capital, Sequoia Capital and Bryant Stibel in August 2016.
F-Prime Capital prepared a market analysis for 2018's year-end discussion. We are sharing it with our broader community in the hope that someone will find in informative, interesting or at least entertaining.
Similar to [2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹 (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
[2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹
1. Talk @ Taiwan AI Academy, November 17, 2018
Textual Data Analytics in Finance
Dr. Chuan-Ju Wang (王釧茹)
Research Center for Information Technology
Innovation, Academia Sinica
Computational Finance and Data Analytics
Laboratory (CFDA Lab)
http://cfda.csie.org
2. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Quant — Data Scientist
Source: http://www.indeed.com/jobtrends
Source: http://www.computerweekly.com/blogs/Data-Matters/2014/06/data-scientist-the-new-quant.html
3. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Data Science in Finance
4. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Text Analytics
❖ Big Data
❖ Structured Data
❖ user logs, sensor logs, click through logs, …
❖ Unstructured Data
❖ web texts, user conversions, public opinions, reports…
❖ Big Data for Unstructured Text – Text Analytics
❖ Goal — Turn text into data for analysis, via application of
natural language processing (NLP) and analytical methods
https://insidebigdata.com/2015/06/05/text-analytics-the-next-generation-of-big-data/
5. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Textual Sentiment Analysis for
Financial Risk Prediction
On the Risk Prediction and Analysis of Soft
Information in Finance Reports. European Journal of
Operational Research (EJOR), 257(1), 243-250, 2017.
6. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Soft and Hard Information in Finance
❖ Growing amount of financial data makes it more and more important
to learn how to discover valuable information for various financial
applications.
❖ In finance, there are typically two kinds of information:
❖ Soft information: text, including opinions, ideas, and market
commentary.
❖ Hard information: numerical values, such as financial measures and
historical prices.
❖ Our work aims to exploit soft information for financial risk prediction.
7. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Risk Proxy: Stock Return Volatility
❖ Stock return
❖ Stock return volatility
❖ A common risk metric measured by the standard
deviation of returns over a period of time.
Rt =
(St St 1)
St 1
v[t n,t] =
t
i=t n(Ri R)2
n
, where R =
t
i=t n
Ri
(n + 1)
.
8. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Financial Sentiment Analysis
❖ In this work, we attempt to apply sentiment analysis on the
risk prediction task.
❖ A finance-specific sentiment lexicon is adopted for analysis.
❖ Two machine learning techniques are adopted for the task:
❖ Regression approach: Predict the stock return volatilities.
❖ Ranking approach: Rank the companies to be in line
with their relative risk levels.
9. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Financial Sentiment Lexicon
❖ Words in finance domain and in general usage usually have
different meanings, such as
❖ vice: immoral or wicked behavior
❖ vice: secondary (in finance context)
❖ Almost three-fourths of the words in the 10-K financial reports
from year 1994 to 2008, which are identified as negative by the
widely used Harvard Psychosociological Dictionary, are
typically not considered negative in financial contexts.
10. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Six Finance-Specific Lexicons
❖ Loughran and McDonald (2011)
❖ When is a liability not a liability? textual analysis, dictionaries,
and 10-ks. Journal of Finance.
11. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Problem Formulation
❖ Predict target: Future’s stock return volatility (regression) and
future’s relative risk levels (ranking)
❖ Features
❖ Soft textual information: All words or financial sentiment words
❖ Hard numerical information: The twelve months before the
report volatility for each company
v(+12)
2007/3/222006/3/22
Report filing date
2005/3/22
v(-12)
12. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Corpora: The 10-K Corpus
❖ A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC)
❖ Only section 7 “management’s discussion and analysis of financial conditions and results of operations”(MD&A)
❖ The Sarbanes-Oxley Act of 2002: Explain the drastic increase in length during the 2002-2003 period
13. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Experimental Results
14. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Financial Sentiment Terms Analysis
amend
deficit
forbear
delist
defaultsureti
discontinu
wherebi
unabl
disput
concern
profit
violat
regain
uncom
-plet
accid
abl
integr
grantor
ceg
nasdaq
gnb
coven
forbear
waiver
sureti
excelsior
rais
ebix
shelbour
nplacement
syndic
pfc
stage
same
driver
default
small-
cap
seri
hearth
awg
amend
libert
special
benefici sever
breach
doubt
Fin-Neg
Fin-Pos
Fin-Lit
Fin-Unc
Non
SEN
ORG
1
1
2
3
4
5
2
3
4
5
deficit
deficits
default
defaulted
defaulting
defaults
delist
delisted
deslisting
delists
amend
amendable
amendatory
amended
amending
amendment
amendments
amends
forbear
forbearance
forbearances
forbearing
forbears
15. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
FIN10K Prototype Demo
https://cfda.csie.org/10K/
FIN10K: A Web-based Information System for
Financial Report Analysis and Visualization.
ACM CIKM (Demo paper), 2016.
16. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Financial Keyword Expansion via
Continuous Word Vector Representations
Discovering Finance Keywords via Continuous
Space Language Models. ACM Transactions on
Management Information Systems, 7(3), 7:1-7:17, 2016.
17. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Sentiment Analysis — the Lexicon
❖ For sentiment analysis, the lexicon is one of the most
important and common resources.
❖ Usually have a great impact on results and the
corresponding analyses
❖ In finance, the lexicon is usually semi-manually generated.
❖ Result in inadequate words
❖ In this work, we attempt to use the advanced continuous space
language models to expand finance keywords automatically.
18. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Continuous Space Language Models
❖ “You shall know a word by the company it keeps”
(J. R. Firth 1957)
❖ One of the most successful ideas of modern statistical NLP!
19. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Continuous Space Language Models
❖ Continuous space language models
❖ a.k.a. Continuous word embeddings
❖ Words are represented as low-rank dense vectors.
❖ Recent studies show their superiority in capturing
syntactic and contextual regularities in language.
20. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Keyword Expansion
❖ Our Proposed Keyword Expansion Method
❖ Adapt this technique to incorporate syntactic
information to capture more similarly meaningful
keywords.
❖ Learn vector representations of words via a large
collection of financial reports (domain-specific)
❖ Words in the financial sentiment lexicon are used as seed
words to obtain those within the top N cosine distances.
21. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Keyword Expansion
❖ Keyword Expansion with Syntactic Information
22. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
The New 10-K Corpus
23. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Four Prediction Tasks
❖ Four prediction tasks are conducted.
❖ To demonstrate that our approach is effective for
discovering predictability keywords
1) Post-event volatility
2) Stock volatility
3) Abnormal trading volume
4) Excess returns
24. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Postevent Volatility Prediction
25. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
FIN10K Prototype Demo
https://cfda.csie.org/10K/
FIN10K: A Web-based Information System for Financial Report Analysis
and Visualization. ACM CIKM (Demo paper), 2016.
26. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Beyond Word-Level Analysis
❖ Multi-word expression detection and analysis
❖ Beyond Word-Level to Sentence-Level Sentiment Analysis for
Financial Reports
❖ RiskFinder: A Sentence-level Risk Detector for Financial Reports,
NAACL’18
❖ https://cfda.csie.org/RiskFinder/
❖ FRIDAYS: A Financial Risk Information Detecting and Analyzing
System, AAAI’18
❖ https://cfda.csie.org/FRIDAYS/
27. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Summary
❖ If structured data is big, then unstructured data is huge.
❖ 20% (structured) vs. 80% (unstructured)
❖ There is a massive potential waiting to be leveraged in
the analysis of unstructured data in the field of finance.
28. Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Thanks for Your Listening!