What Goes Wrong with Language Definitions and How to Improve the Situation
Company Search Project
1. When searching for investment opportunities, finding a company
to research is the first critical step. Current online tools often
neglect this very important aspect of investment research..
CompanySearch helps users find publicly traded companies
based on financial filters and user-defined keywords
2. Information Need
● With thousands of companies to choose from, it can be difficult for an investor to
find publicly traded companies which fit their individual investment needs.
● Casual investors need a way to transform broad ideas (i.e. “I am interested in the
machine learning revolution”) into tangible investment prospects (i.e. “I should
invest in Baidu”)
● CompanySearch is the culmination of this information need. In the above
example, an investor might enter the category “information technology” along
with the keywords “machine”, “learning”, “artificial”, and “intelligence”.
CompanySearch will return relevant companies like Baidu!
3. Input/Output Good Example
● Categories: Large Cap, Value
● Keywords: Airlines, Travel, Overseas
● Top Result: United Airlines
● Analysis: This result matched our query because it was a large cap, value stock and
matched several keywords. Additionally, the company description was found to be
topically similar to the keywords by LDA.
4. Input/Output Bad Example
● Categories: Information Technology
● Keywords: Social, Mobile, Video, Snapchat
● Top Result: Tegna Inc.
● Analysis: Tegna was selected because by its own company description, it plays a
heavy roll in social and mobile platforms. Even though it has nothing to do with
snapchat, our system does not penalize for that. A more accurate result might
have been SNAP, which more closely matches the query but whose company
description is very vague and does not match many of the keywords. In general,
our system does not perform well on very specific queries with a specific company
in mind. Then again, that was never our intended use case.
6. Changes Since First Prototype
● Many UI improvements based on user feedback emphasizing readability
● Integration of machine learning, specifically LDA and k-means clustering
● Twitter data is now cached, addressing limitations of the Twitter API mentioned
in peer review
● Users are now given feedback as to why results are shown, addressing the need for
transparency brought up in peer review
● Improved ranking metric, weighing keyword similarity above categorical
similarity
● Removed query expansion (it was too buggy and gave odd answers, i.e. the
“machine” in “machine learning” expanded to “car” and “auto”)
7. Qualitative Evaluation
● Categories: Large Cap, Value
● Keywords: Airlines, Travel, Overseas
● Prototype 2 Output: World Fuel Services
● Final Project First Output: United Airlines
● Analysis: Our final project gives a much better answer than our second prototype.
Using topic modeling, our system can better hone in on airlines and travelling
agencies. Additionally, removing thesaurus-based query expansion resulted in
much more predictable results!
9. Trial and Error / Challenges
● There was a significant tradeoff between number of topics used in LDA and the
runtime of our application. In general, making the tradeoff between runtime and
performance was the largest challenge we dealt with
● We implemented pseudo-Rocchio query expansion and thesaurus-based query
expansion, and reverse stemming, but found all led to worse query results :(
● We tried to make the graph in our results interactive, but it proved to be very time
consuming and resource intensive
10. Improvements
● Better integration with currently available investment tools. Specifically,
integrating Bloomberg services into our application could make it a “one stop
shop” from start to finish for investors
● Use of enterprise Twitter API would allow for more current tweets and thus more
relevant user results
Known Issues
● Not every company in our data set has a description (although the majority do)
● Graphical results are pulled from CNN, which on occasion has issues preventing
the images from downloading