Using FME we build an API to collect and clean the US federal Security & Exchange Commission quarterly filings from the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website. Using FME to quickly pool the filing data we perform sentiment analysis on the cleaned unstructured Management Discussion & Analysis (MD&A) data. We implement word to vector strategies to tokenize the fairly boilerplate text and assign the companies into groupings of changer and non-changer companies. This is done mainly graphing deltas in cosine similarity in the tokenized word vectors and also using word count vector strategies to flag language unattractive to investment. The end goal for this analysis is to forecast abnormal returns and find diversification opportunities which align with our existing clients.
4. Frame the question
What do we want to know
Is there data to support
Has it been done before
Can FME improve the process
5. Frame the question
What do we want to know: detect change
Is there data to support
Has it been done before
Can FME improve the process
6. Frame the question
What do we want to know: detect change
Is there data to support: yes
Has it been done before
Can FME improve the process
7. Frame the question
What do we want to know: detect change
Is there data to support: yes
Has it been done before: yes (blog)
Can FME improve the process
8. Frame the question
What do we want to know: detect change
Is there data to support: yes
Has it been done before: yes (blog)
Can FME improve the process: absolutely
10. Dataflow
Fetch & Prep
Collect text/html based
data
FME to Automate
Extract numeric values
for SAP. Extract and tidy
text for sentment
Pass the result
Summary info and
binning to lessen the
information load
+ +
11. Dataflow
Fetch & Prep
Collect text/html based
data
FME to Automate
Extract numeric values
for SAP. Extract and tidy
text for sentment
Pass the result
Summary info and
binning to lessen the
information load
+ +
12. Dataflow
Fetch & Prep
Collect text/html based
data
FME to Automate
Extract numeric values
for SAP. Extract and tidy
text for sentment
Pass the result
Summary info and
binning to lessen the
information load
+ +
13. Dataflow
Fetch & Prep
Collect text/html based
data
FME to Automate
Extract numeric values
for SAP. Extract and tidy
text for sentment
Pass the result
Summary info and
binning to lessen the
information load
+ +
14. Dataflow
Fetch & Prep
Collect text/html based
data
FME to Automate
Extract numeric values
for SAP. Extract and tidy
text for sentment
Pass the result
Summary info and
binning to lessen the
information load
+ +
23. NLP tell me more
Semantic search
● Key phrases, entities & sentiment*
Identify topics
● Organize by market sector, similarity
24. NLP tell me more
Word vectoring
● Unique word counts, similarity indices,
representing bodies of text as
numerical arrays.
Measure similarity
● For a given company build vectoring
● Change detect intra-company but inter-
document
25. NLP small swings
While mostly Neutral
● Some Positive/Negative characteristics are there
26. NLP what can it offer
Word vectoring
● Mathematical representation of relatedness
Identify topics
● Flagged terms
27. NLP swings in neutral
Graph sentiment changes
● Flag to investigate
Graph other metrics
● Graph EPS
28. NLP swings in neutral
Graph sentiment changes
● Flag to investigate
Graph other metrics
● Graph EPS
How much are they spending on X (what are their reserves)
Their overall financial indicators (eps, share price, revenues, etc.)
What are they spending money on (investments)
Sustainability and social responsibility
Compare companies across a sector
Quite a lot of data in fact
FME is great at building an API , retrieve the textfiles, parse them for numeric financial data, cleaning the textfiles to create a unified corpus.
https://aws.amazon.com/architecture/icons/
https://www.sec.gov/index.htm
https://upload.wikimedia.org/wikipedia/commons/5/59/SAP_2011_logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg
Use FME to grab the catalog files, use those catalog to retrieve the filing data files you are actually after, clean the resultant files for hierarchy and maintain CIK identifier information.(Central Index Key)
https://aws.amazon.com/architecture/icons/
https://www.sec.gov/index.htm
https://upload.wikimedia.org/wikipedia/commons/5/59/SAP_2011_logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg
Use FME to clean the resultant files to rebuild hierarchy and maintain CIK identifier information.(Central Index Key)
https://aws.amazon.com/architecture/icons/
https://www.sec.gov/index.htm
https://upload.wikimedia.org/wikipedia/commons/5/59/SAP_2011_logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg
Use FME to chase out the HTML encoding and prep dense text data.
https://aws.amazon.com/architecture/icons/
https://www.sec.gov/index.htm
https://upload.wikimedia.org/wikipedia/commons/5/59/SAP_2011_logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg
Use FME assemble the CIK based file organization to start the clean catalog.
https://aws.amazon.com/architecture/icons/
https://www.sec.gov/index.htm
https://upload.wikimedia.org/wikipedia/commons/5/59/SAP_2011_logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0a/Python.svg
By the time we present at WT the batteries included version of NLP will be in latest official release
Is it still important to mess around with external libraries and cloud services?
What can you accomplish with external libraries and service calls?
Messing with python libraries in FME is extremely useful, it can prove to tricky to maintain symlinks and version parity but is worth understanding how to drive external services from within FME
FME 2019.0 ships with 3.6 & 3.7
Tensorflow / keras as of this writing does not work above 3.6.x; it may require 3.6.x install on your FME machine
When symlinks fail, Manually copying the libraries to achieve library parity
KB article : https://knowledge.safe.com/questions/49484/how-do-i-set-the-import-path-in-fmes-python-interp.html
The json dictionary object returned by AWS needs to be caught and decomposed before passing it to a FME Object. (stumbled across that one)
The AWS API has a (very) limiting byte length it will take. Loop accordingly
Use JSON Extractor to expose these returned attributes
Limitations 5000 bytes, pay as you go usage
No training, reasonably priced.
Sentiment or lack of Neutral (degrees of neutrality)
Degrees around the Neutral sentiment? Or lack there of