DEVELOPING A BIG DATA
ANALYTICS FRAMEWORK FOR
INDUSTRY INTELLIGENCE
Gene Moo Lee
SAUDER SCHOOL OF BUSINESS
UNIVERSITY OF BRITISH COLUMBIA
Gene Moo Lee, CAIDA, Jan 2019
I do research on Business Analytics
It is about analyzing Big Data for decision making
One of challenges is on unstructured data (80-90%)
• Text, photo, audio, video
Approach:
• Machine learning, text mining, deep learning
• Econometrics, network analysis
2
Gene Moo Lee, CAIDA, Jan 2019
Research on business analytics
3
Social Media Cyber Security
Industry Dynamics Mobile Ecosystem
Gene Moo Lee, CAIDA, Jan 2019
Analyzing industry dynamics using
network approach
• Industry can be modeled as a network, where node represents firm
and edge represents dyadic relationship between a pair of firms
• Inter-firm relationships: M&A, Competition, Opportunity, Alliance
4
Toward A Better Measure of Business Proximity:
Topic Modeling for Industry Intelligence
MIS Quarterly 2016
Opportunity Structures: A Machine Learning Approach for Analyzing
Industry Dynamics
Submitted to AOM 2019
Toward A Better Measure of
Business Proximity:
Topic Modeling for Industry Intelligence
Gene Moo Lee (UBC)
Joint work with
Zhan Shi (Arizona State)
Andrew B. Whinston (UT Austin)
MIS Quarterly 2016
Gene Moo Lee, CAIDA, Jan 2019
Business proximity: Motivation
• To measure firms’ dyadic relatedness in product, market, and technology
• Essential in strategy / industrial organization fields
• Existing methods
• Common industry code, patent holdings, geographic distance
• Limitations: strong data requirements, not fine-grained, static
• Our approach: text mining on business description text corpus
• Can be automated to capture industry dynamics
6
Gene Moo Lee, CAIDA, Jan 2019
Our approach on business proximity
• Approach: LDA topic modeling [Blei et al. 2003]
• Unsupervised learning to discover latent “topics” from a large
collection of documents
• Business proximity = cosine similarity of topic distributions
7
LDA
Industry-wide topics
Company’s topics
Company
descriptions
Gene Moo Lee, CAIDA, Jan 2019
CrunchBase data on high-tech industry
• CrunchBase: open database (“Wikipedia”) of high-tech industry
• Data collection time: April 2013 ~ April 2015
• 24,382 U.S. high-tech companies (1.4% public, 5.7 years old)
• HQ location, CB-defined industry sector, key personnels, M&A,
investments, business summary
• Leading states: CA, NY, MA, TX
• Leading industries: software, web, e-commerce, ad, mobile
8
Gene Moo Lee, CAIDA, Jan 2019
LDA topic model with CrunchBase
9
Video/music
Energy
Sports
Healthcare
Gene Moo Lee, CAIDA, Jan 2019
Empirical analysis on M&A network
• Objective: examine the relationship between likelihood of M&A link
formation and nodal/dyadic characteristics
• Challenges: incorporate inter-relatedness of M&As
• Logit/probit cannot capture dependency between obs.
• Model all M&A deals as a graph with ERGM or p* model
10
Gene Moo Lee, CAIDA, Jan 2019
ERGM for M&A network
ERGM (Exponential Random Graph Model):
• Based on random graph [Erdos and Renyi 1959]
• Probability of realizing a graph = a function of the graph’s
statistics [Robins et al. 2007]
• Inter-firm proximity: business, geographic, social, co-invest
• Selective mixing: 50 states, 30 industry sectors
• Degree distribution: node degree, M&A experiences
11
degree selective mixing proximity
Gene Moo Lee, CAIDA, Jan 2019
Estimation setup
• Dataset
• US firms founded from 2008 to 2012: |V| = 24,382
• All dyadic/nodal attributes collected in April 2013
• M&A transactions (April 2013~April 2015): |E| = 394
• Estimate our ERGM M&A model
• Randomly sample 25% companies for computational feasibility
• Run 100 condor jobs with 100 sample graphs
• Estimate model coefficients by Markov chain Monte Carlo
(MCMC) maximum likelihood estimation (MLE)
12
Gene Moo Lee, CAIDA, Jan 2019
ERGM estimation from a sample
13
Gene Moo Lee, CAIDA, Jan 2019
Empirical results on proximities
14
• Proximities are normalized for comparison
• 1.0 stdev increase in business proximity
= 3.64 stdev increase in social proximity
= 6.89 stdev increase in investment proximity
+
+
+
Gene Moo Lee, CAIDA, Jan 2019
Empirical results on complementarity
15
• Original term (+) / Squared term (-) -> reverse U-curve
• Interpretation: M&A transactions between two firms that
have complementarity but not substitutes
• Can find this curvilinear effect because our proximity has
(1) comprehensiveness and (2) continuity
+
+
+
-
Gene Moo Lee, CAIDA, Jan 2019
M&A matching platform w/ business proximity
• M&A matching platform for startups, VCs, and researchers
16
Gene Moo Lee, CAIDA, Jan 2019
Platform UI: Find competitors
● M&A market is a two-sided platform
o buyers: established companies
o sellers: startups
● We can increase the efficiency of this two-sided market by
o building interface, VentureMap, to make data accessible
o recommending matchings with our M&A model
● Potential beneficiaries
o Established firms: intelligence/M&A department
o Startups: identify opportunities, potential buyers
o Venture capitalists
o Market intelligence firms
o Researchers in finance field
17
Gene Moo Lee, CAIDA, Jan 2019
Search firms by business components
18
Gene Moo Lee, CAIDA, Jan 2019
Search firms by business components
19
Gene Moo Lee, CAIDA, Jan 2019
LDA results from U.S.
public firms (1995-
2016)
20
Gene Moo Lee, CAIDA, Jan 2019
Service sector (SIC 7) in
1995-2016
21
Gene Moo Lee, CAIDA, Jan 2019
Healthcare sector in 1995-2016
22
Toward A Better Measure of Business Proximity:
Topic Modeling for Industry Intelligence
MIS Quarterly 2016
Big Data Analytics Special Issue
Opportunity Structures: A Machine Learning Approach for
Analyzing Industry Dynamics
Submitted to AOM 2019
Opportunity Structures:
A Machine Learning Approach for
Analyzing Industry Dynamics
Gene Moo Lee (UBC)
Joint work with
Myunghwan Lee (Yonsei), Hasan Cavusoglu (UBC),
Marc-David L. Seidel (UBC)
Gene Moo Lee, CAIDA, Jan 2019
Structural hole as business opportunity
• In network perspective, we view industry as a network of firms
• Burt (1992) argues that as industry networks become
centralized, emerging “structural holes” serve as entry points
for new firms: new business opportunity
25
Publisher 1
Publisher 2 Subscriber B
Subscriber A
S.H.
S.H.
S.H.
S.H.
S.H.
S.H.
S.H.
Fully-connected Centralized
Gene Moo Lee, CAIDA, Jan 2019
Definition of structural holes
• A pair of firms (I, J) has a structural hole w.r.t. firm K if
• (1) (I, J) are not connected
• (2) (I, K) and (J, K) are connected
• A pair of firms is connected if their 10-Ks are sufficiently similar to
each other (LDA or doc2vec model)
• A pair of firms can have 0 to N-2 holes (N=# firms in year t)
26
Gene Moo Lee, CAIDA, Jan 2019
Data: Annual reports of U.S. public firms
• Form 10-K filings from SEC EDGAR: 165K reports (1995-2016)
• Use doc2vec [Le & Mikolov 2014] to get semantic vectors of 10-Ks
• Data: IPO/delisting events, financial and accounting metrics
27
Gene Moo Lee, CAIDA, Jan 2019
Firm positioning with doc2vec: Link
28
Gene Moo Lee, CAIDA, Jan 2019
Preliminary empirical results
• The likelihood of IPO entry between two firms has a positive relationship
with # structural holes between them. (Supported with GLM method)
• # of structural holes for an IPO has a U-shape relationship with the firm’s
ultimate mortality rate. (Partially supported with Cox hazard model)
29
Sector
Mining &
Construction
Wholesale &
Retail
Finance Services
Dependent
Variable
Hazard Rate (Delisting)
# structural
holes
n.s. negative* negative** negative*
# structural
holes
squared
positive+ n.s positive* n.s
# firm-year
obs.
1,215 1,578 3,322 2,786
# delisting
events
35 67 72 181
Gene Moo Lee, CAIDA, Jan 2019
Concluding remarks
• Business Analytics is an emerging research area
• To apply AI, ML, and NLP on business data
• To gather insights for decision making
• Important business and societal decisions now depend on AI
• Many interesting research topics!
30
Contact Info: Gene Moo Lee
gene.lee@sauder.ubc.ca
Gene Moo Lee, CAIDA, Jan 2019
Inter-firm relationships
• M&A: 1689 total
• cross-state: 62.6%
• cross-sector: 63.6%
• top 10 buyers: 14.3%
(skewed)
• Investments: 531 total
• Job mobility: 19K total
32
Gene Moo Lee, CAIDA, Jan 2019
Validation: Leading effect on business networks
• Avg. business proximity
• 0.293 (394 M&A pairs)
• 0.224 (129 invests pairs)
• 0.218 (9792 job mobility pairs)
• 0.068 (random pairs)
33
Gene Moo Lee, CAIDA, Jan 2019
M&A matching platform w/ business proximity
• “Data-driven” platform for M&A matching and startup search
1. M&A executives to find M&A targets
2. Entrepreneurs to position their products
3. Venture capitals to monitor niche markets
4. Analysts to examine the industry trends
• Implemented a cloud-based IS based on proposed business proximity
34
Gene Moo Lee, CAIDA, Jan 2019
Cloud-based platform design
35
Big Data and Cloud technologies: Cronjob, NoSQL, Python, Scala,
Condor, Google Cloud (Storage, App Engine, Datastore) and more
Item Details
Item 1
“Business” –
Description of companies business, it’s main products and services, subsidiaries it
owns, market it operates in, recent events, competitions, etc.
Item 1A
“Risk Factors” –
Most significant risks that apply to company or its securities, listed in order of their
importance.
Item 1B
“Unresolved Staff Comments” –
Explain certain comments it received from SEC staff on previous filings that have not
been resolved over an extended period of time.
Item 2
“Properties”–
Information about companies significant properties like principal plants, mines, etc.
Item 3
“Legal Proceedings” –
Information about significant pending lawsuits or other legal proceedings, other than
ordinary litigation
Item 4 This item has no required information, but is reserved by SEC for future rule making
What are in 10-K annual reports?
36
Part 1 (Items1-4), Part 2 (Items 5-9), Part 3, Part 4
Gene Moo Lee, CAIDA, Jan 2019
Information system design
37
Gene Moo Lee, CAIDA, Jan 2019
SEC data from EDGAR
Electronic Data Gathering, Analysis, and Retrieval system
• U.S. Securities and Exchange Commission (SEC)
• 513K firms/individuals (CIK) in 451 industry sectors (SIC)
• 23K cities, 43K ZIP codes, 251 states (incl. non-US)
• 11.2M forms filed of 723 types (4, 8-K, 10-K/10-Q, etc.)
• 10-K documents’ text were parsed (Parts and Items)
• Duration: 1995 ~ on-going
38
Gene Moo Lee, CAIDA, Jan 2019
Number of 10-K filers
1995-2016
39
Gene Moo Lee, CAIDA, Jan 2019
Headquarters of 10-K filers
1995-2016
40
Gene Moo Lee, CAIDA, Jan 2019
Analysis I: Trend analysis with LDA
• Approach: LDA topic modeling [Blei et al. 2003]
• Unsupervised learning to discover latent “topics” from a large
collection of documents
• Each document is represented as a distribution over the topics
41
LDA
Industry-wide topics
Company’s topics
Business
descriptions
(10-K)
Gene Moo Lee, CAIDA, Jan 2019
Microsoft Topics, 1995 - 2016
42
Dominating topics:
software/tech/data,
president/vice/exec
utive, stores/retail,
systems/manufactu
ring/tech
Gene Moo Lee, CAIDA, Jan 2019
Interactive visualization demo: Link
• Developed an interactive visualization tool to
identify industry topic trends by:
• Selected time windows (time-based)
• Industry sectors (SIC code-based)
• Geographic location (city, state-based)
• A particular firm (CIK-based)
43
Gene Moo Lee, CAIDA, Jan 2019
Analysis II: Competitive analysis with word2vec
• Approach: word embedding called word2vec [Mikolov et al. 2013]
• Represent words in a high-dimensional vector space where
semantically similar words are nearby
• Train a model that maximizes prediction of words co-occurrence
(K words before/after the focal word)
• Competition level = distance between word vectors
44
Word2Vec
Business
descriptions
(10-K)
Gene Moo Lee, CAIDA, Jan 2019
Static word2vec: Google
Nearest neighbors:
Yahoo, Linkedin,
Amazon, Netflix,
Facebook,
Youtube, Hulu,
Bing
45
Gene Moo Lee, CAIDA, Jan 2019
Static word2vec: Android
Nearest neighbors:
apple_ios,
google_android,
symbian,
ipad_iphone
46
Gene Moo Lee, CAIDA, Jan 2019
Temporal word2vec: Apple
Nearest neighbors:
• 1995: jbl, unisys, novell
• 2000:
information_technology,
isuzu, powertel_inc
• 2005: dell, panasonic,
emi, midi
• 2010: amazon, sony, dell
• 2015: google, iphone,
ipad, sony
47
Gene Moo Lee, CAIDA, Jan 2019
Temporal word2vec: IBM
Nearest neighbors:
• 1995: microsoft,
nordstorm, apple
• 2000: oracle, sun,
sun_microsystems
• 2005: motorola, dell,
cisco, sun_microsystems
• 2010: hewlett_packard,
motorola, oracle, dell
• 2015: dell, oracle, nokia,
microsoft
48
Gene Moo Lee, WITS, Seoul, Korea, December 2017
Google Topics 1995 - 2016
49
Dominating topics:
software/tech/data,
advertising/televisio
n/media
Gene Moo Lee, WITS, Seoul, Korea, December 2017
Static word2vec results: iPhone
Nearest neighbors:
mobile_platforms,
ipad,
android_phones,
tablet_devices,
ipod_touch,
handheld_devices
50

Developing A Big Data Analytics Framework for Industry Intelligence

  • 1.
    DEVELOPING A BIGDATA ANALYTICS FRAMEWORK FOR INDUSTRY INTELLIGENCE Gene Moo Lee SAUDER SCHOOL OF BUSINESS UNIVERSITY OF BRITISH COLUMBIA
  • 2.
    Gene Moo Lee,CAIDA, Jan 2019 I do research on Business Analytics It is about analyzing Big Data for decision making One of challenges is on unstructured data (80-90%) • Text, photo, audio, video Approach: • Machine learning, text mining, deep learning • Econometrics, network analysis 2
  • 3.
    Gene Moo Lee,CAIDA, Jan 2019 Research on business analytics 3 Social Media Cyber Security Industry Dynamics Mobile Ecosystem
  • 4.
    Gene Moo Lee,CAIDA, Jan 2019 Analyzing industry dynamics using network approach • Industry can be modeled as a network, where node represents firm and edge represents dyadic relationship between a pair of firms • Inter-firm relationships: M&A, Competition, Opportunity, Alliance 4 Toward A Better Measure of Business Proximity: Topic Modeling for Industry Intelligence MIS Quarterly 2016 Opportunity Structures: A Machine Learning Approach for Analyzing Industry Dynamics Submitted to AOM 2019
  • 5.
    Toward A BetterMeasure of Business Proximity: Topic Modeling for Industry Intelligence Gene Moo Lee (UBC) Joint work with Zhan Shi (Arizona State) Andrew B. Whinston (UT Austin) MIS Quarterly 2016
  • 6.
    Gene Moo Lee,CAIDA, Jan 2019 Business proximity: Motivation • To measure firms’ dyadic relatedness in product, market, and technology • Essential in strategy / industrial organization fields • Existing methods • Common industry code, patent holdings, geographic distance • Limitations: strong data requirements, not fine-grained, static • Our approach: text mining on business description text corpus • Can be automated to capture industry dynamics 6
  • 7.
    Gene Moo Lee,CAIDA, Jan 2019 Our approach on business proximity • Approach: LDA topic modeling [Blei et al. 2003] • Unsupervised learning to discover latent “topics” from a large collection of documents • Business proximity = cosine similarity of topic distributions 7 LDA Industry-wide topics Company’s topics Company descriptions
  • 8.
    Gene Moo Lee,CAIDA, Jan 2019 CrunchBase data on high-tech industry • CrunchBase: open database (“Wikipedia”) of high-tech industry • Data collection time: April 2013 ~ April 2015 • 24,382 U.S. high-tech companies (1.4% public, 5.7 years old) • HQ location, CB-defined industry sector, key personnels, M&A, investments, business summary • Leading states: CA, NY, MA, TX • Leading industries: software, web, e-commerce, ad, mobile 8
  • 9.
    Gene Moo Lee,CAIDA, Jan 2019 LDA topic model with CrunchBase 9 Video/music Energy Sports Healthcare
  • 10.
    Gene Moo Lee,CAIDA, Jan 2019 Empirical analysis on M&A network • Objective: examine the relationship between likelihood of M&A link formation and nodal/dyadic characteristics • Challenges: incorporate inter-relatedness of M&As • Logit/probit cannot capture dependency between obs. • Model all M&A deals as a graph with ERGM or p* model 10
  • 11.
    Gene Moo Lee,CAIDA, Jan 2019 ERGM for M&A network ERGM (Exponential Random Graph Model): • Based on random graph [Erdos and Renyi 1959] • Probability of realizing a graph = a function of the graph’s statistics [Robins et al. 2007] • Inter-firm proximity: business, geographic, social, co-invest • Selective mixing: 50 states, 30 industry sectors • Degree distribution: node degree, M&A experiences 11 degree selective mixing proximity
  • 12.
    Gene Moo Lee,CAIDA, Jan 2019 Estimation setup • Dataset • US firms founded from 2008 to 2012: |V| = 24,382 • All dyadic/nodal attributes collected in April 2013 • M&A transactions (April 2013~April 2015): |E| = 394 • Estimate our ERGM M&A model • Randomly sample 25% companies for computational feasibility • Run 100 condor jobs with 100 sample graphs • Estimate model coefficients by Markov chain Monte Carlo (MCMC) maximum likelihood estimation (MLE) 12
  • 13.
    Gene Moo Lee,CAIDA, Jan 2019 ERGM estimation from a sample 13
  • 14.
    Gene Moo Lee,CAIDA, Jan 2019 Empirical results on proximities 14 • Proximities are normalized for comparison • 1.0 stdev increase in business proximity = 3.64 stdev increase in social proximity = 6.89 stdev increase in investment proximity + + +
  • 15.
    Gene Moo Lee,CAIDA, Jan 2019 Empirical results on complementarity 15 • Original term (+) / Squared term (-) -> reverse U-curve • Interpretation: M&A transactions between two firms that have complementarity but not substitutes • Can find this curvilinear effect because our proximity has (1) comprehensiveness and (2) continuity + + + -
  • 16.
    Gene Moo Lee,CAIDA, Jan 2019 M&A matching platform w/ business proximity • M&A matching platform for startups, VCs, and researchers 16
  • 17.
    Gene Moo Lee,CAIDA, Jan 2019 Platform UI: Find competitors ● M&A market is a two-sided platform o buyers: established companies o sellers: startups ● We can increase the efficiency of this two-sided market by o building interface, VentureMap, to make data accessible o recommending matchings with our M&A model ● Potential beneficiaries o Established firms: intelligence/M&A department o Startups: identify opportunities, potential buyers o Venture capitalists o Market intelligence firms o Researchers in finance field 17
  • 18.
    Gene Moo Lee,CAIDA, Jan 2019 Search firms by business components 18
  • 19.
    Gene Moo Lee,CAIDA, Jan 2019 Search firms by business components 19
  • 20.
    Gene Moo Lee,CAIDA, Jan 2019 LDA results from U.S. public firms (1995- 2016) 20
  • 21.
    Gene Moo Lee,CAIDA, Jan 2019 Service sector (SIC 7) in 1995-2016 21
  • 22.
    Gene Moo Lee,CAIDA, Jan 2019 Healthcare sector in 1995-2016 22
  • 23.
    Toward A BetterMeasure of Business Proximity: Topic Modeling for Industry Intelligence MIS Quarterly 2016 Big Data Analytics Special Issue Opportunity Structures: A Machine Learning Approach for Analyzing Industry Dynamics Submitted to AOM 2019
  • 24.
    Opportunity Structures: A MachineLearning Approach for Analyzing Industry Dynamics Gene Moo Lee (UBC) Joint work with Myunghwan Lee (Yonsei), Hasan Cavusoglu (UBC), Marc-David L. Seidel (UBC)
  • 25.
    Gene Moo Lee,CAIDA, Jan 2019 Structural hole as business opportunity • In network perspective, we view industry as a network of firms • Burt (1992) argues that as industry networks become centralized, emerging “structural holes” serve as entry points for new firms: new business opportunity 25 Publisher 1 Publisher 2 Subscriber B Subscriber A S.H. S.H. S.H. S.H. S.H. S.H. S.H. Fully-connected Centralized
  • 26.
    Gene Moo Lee,CAIDA, Jan 2019 Definition of structural holes • A pair of firms (I, J) has a structural hole w.r.t. firm K if • (1) (I, J) are not connected • (2) (I, K) and (J, K) are connected • A pair of firms is connected if their 10-Ks are sufficiently similar to each other (LDA or doc2vec model) • A pair of firms can have 0 to N-2 holes (N=# firms in year t) 26
  • 27.
    Gene Moo Lee,CAIDA, Jan 2019 Data: Annual reports of U.S. public firms • Form 10-K filings from SEC EDGAR: 165K reports (1995-2016) • Use doc2vec [Le & Mikolov 2014] to get semantic vectors of 10-Ks • Data: IPO/delisting events, financial and accounting metrics 27
  • 28.
    Gene Moo Lee,CAIDA, Jan 2019 Firm positioning with doc2vec: Link 28
  • 29.
    Gene Moo Lee,CAIDA, Jan 2019 Preliminary empirical results • The likelihood of IPO entry between two firms has a positive relationship with # structural holes between them. (Supported with GLM method) • # of structural holes for an IPO has a U-shape relationship with the firm’s ultimate mortality rate. (Partially supported with Cox hazard model) 29 Sector Mining & Construction Wholesale & Retail Finance Services Dependent Variable Hazard Rate (Delisting) # structural holes n.s. negative* negative** negative* # structural holes squared positive+ n.s positive* n.s # firm-year obs. 1,215 1,578 3,322 2,786 # delisting events 35 67 72 181
  • 30.
    Gene Moo Lee,CAIDA, Jan 2019 Concluding remarks • Business Analytics is an emerging research area • To apply AI, ML, and NLP on business data • To gather insights for decision making • Important business and societal decisions now depend on AI • Many interesting research topics! 30
  • 31.
    Contact Info: GeneMoo Lee gene.lee@sauder.ubc.ca
  • 32.
    Gene Moo Lee,CAIDA, Jan 2019 Inter-firm relationships • M&A: 1689 total • cross-state: 62.6% • cross-sector: 63.6% • top 10 buyers: 14.3% (skewed) • Investments: 531 total • Job mobility: 19K total 32
  • 33.
    Gene Moo Lee,CAIDA, Jan 2019 Validation: Leading effect on business networks • Avg. business proximity • 0.293 (394 M&A pairs) • 0.224 (129 invests pairs) • 0.218 (9792 job mobility pairs) • 0.068 (random pairs) 33
  • 34.
    Gene Moo Lee,CAIDA, Jan 2019 M&A matching platform w/ business proximity • “Data-driven” platform for M&A matching and startup search 1. M&A executives to find M&A targets 2. Entrepreneurs to position their products 3. Venture capitals to monitor niche markets 4. Analysts to examine the industry trends • Implemented a cloud-based IS based on proposed business proximity 34
  • 35.
    Gene Moo Lee,CAIDA, Jan 2019 Cloud-based platform design 35 Big Data and Cloud technologies: Cronjob, NoSQL, Python, Scala, Condor, Google Cloud (Storage, App Engine, Datastore) and more
  • 36.
    Item Details Item 1 “Business”– Description of companies business, it’s main products and services, subsidiaries it owns, market it operates in, recent events, competitions, etc. Item 1A “Risk Factors” – Most significant risks that apply to company or its securities, listed in order of their importance. Item 1B “Unresolved Staff Comments” – Explain certain comments it received from SEC staff on previous filings that have not been resolved over an extended period of time. Item 2 “Properties”– Information about companies significant properties like principal plants, mines, etc. Item 3 “Legal Proceedings” – Information about significant pending lawsuits or other legal proceedings, other than ordinary litigation Item 4 This item has no required information, but is reserved by SEC for future rule making What are in 10-K annual reports? 36 Part 1 (Items1-4), Part 2 (Items 5-9), Part 3, Part 4
  • 37.
    Gene Moo Lee,CAIDA, Jan 2019 Information system design 37
  • 38.
    Gene Moo Lee,CAIDA, Jan 2019 SEC data from EDGAR Electronic Data Gathering, Analysis, and Retrieval system • U.S. Securities and Exchange Commission (SEC) • 513K firms/individuals (CIK) in 451 industry sectors (SIC) • 23K cities, 43K ZIP codes, 251 states (incl. non-US) • 11.2M forms filed of 723 types (4, 8-K, 10-K/10-Q, etc.) • 10-K documents’ text were parsed (Parts and Items) • Duration: 1995 ~ on-going 38
  • 39.
    Gene Moo Lee,CAIDA, Jan 2019 Number of 10-K filers 1995-2016 39
  • 40.
    Gene Moo Lee,CAIDA, Jan 2019 Headquarters of 10-K filers 1995-2016 40
  • 41.
    Gene Moo Lee,CAIDA, Jan 2019 Analysis I: Trend analysis with LDA • Approach: LDA topic modeling [Blei et al. 2003] • Unsupervised learning to discover latent “topics” from a large collection of documents • Each document is represented as a distribution over the topics 41 LDA Industry-wide topics Company’s topics Business descriptions (10-K)
  • 42.
    Gene Moo Lee,CAIDA, Jan 2019 Microsoft Topics, 1995 - 2016 42 Dominating topics: software/tech/data, president/vice/exec utive, stores/retail, systems/manufactu ring/tech
  • 43.
    Gene Moo Lee,CAIDA, Jan 2019 Interactive visualization demo: Link • Developed an interactive visualization tool to identify industry topic trends by: • Selected time windows (time-based) • Industry sectors (SIC code-based) • Geographic location (city, state-based) • A particular firm (CIK-based) 43
  • 44.
    Gene Moo Lee,CAIDA, Jan 2019 Analysis II: Competitive analysis with word2vec • Approach: word embedding called word2vec [Mikolov et al. 2013] • Represent words in a high-dimensional vector space where semantically similar words are nearby • Train a model that maximizes prediction of words co-occurrence (K words before/after the focal word) • Competition level = distance between word vectors 44 Word2Vec Business descriptions (10-K)
  • 45.
    Gene Moo Lee,CAIDA, Jan 2019 Static word2vec: Google Nearest neighbors: Yahoo, Linkedin, Amazon, Netflix, Facebook, Youtube, Hulu, Bing 45
  • 46.
    Gene Moo Lee,CAIDA, Jan 2019 Static word2vec: Android Nearest neighbors: apple_ios, google_android, symbian, ipad_iphone 46
  • 47.
    Gene Moo Lee,CAIDA, Jan 2019 Temporal word2vec: Apple Nearest neighbors: • 1995: jbl, unisys, novell • 2000: information_technology, isuzu, powertel_inc • 2005: dell, panasonic, emi, midi • 2010: amazon, sony, dell • 2015: google, iphone, ipad, sony 47
  • 48.
    Gene Moo Lee,CAIDA, Jan 2019 Temporal word2vec: IBM Nearest neighbors: • 1995: microsoft, nordstorm, apple • 2000: oracle, sun, sun_microsystems • 2005: motorola, dell, cisco, sun_microsystems • 2010: hewlett_packard, motorola, oracle, dell • 2015: dell, oracle, nokia, microsoft 48
  • 49.
    Gene Moo Lee,WITS, Seoul, Korea, December 2017 Google Topics 1995 - 2016 49 Dominating topics: software/tech/data, advertising/televisio n/media
  • 50.
    Gene Moo Lee,WITS, Seoul, Korea, December 2017 Static word2vec results: iPhone Nearest neighbors: mobile_platforms, ipad, android_phones, tablet_devices, ipod_touch, handheld_devices 50

Editor's Notes

  • #8 We propose a novel business proximity based on Latent Dirichlet Allocation. Companies’ business descriptions are the input to LDA. Then LDA produces (a) industry-wide topics and (b) topic distribution for each company. Then inter-firm proximity is calculated with cosine similarity between two topic distribution vectors.
  • #9 Introducing the data, we collected data from CrunchBase, which is the “Wikipedia” for high-tech industry. It has information on 24K U.S. companies, including locations, key people, transactions, and business summary. California, New York are the hubs. Software, web, e-commerce are the leading industry sectors.
  • #10 Giving you a sense on the constructed topic model, here is the topic model built with companies’ business descriptions. Let’s take a look at some topics.
  • #11 Now here is our research question: What are the underlying driving forces that realized the M&A network? There are three challenges we try to approach.
  • #12 Based on the constructed proximity, the next thing to do is to model the M&A network. To incorporate the interdependency of different M&A transactions, we employ exponential random graph model (ERGM). The idea is that: given a set of nodes, the probability of realizing a specific graph is a function of various graph’s statistics. Proximity: p = sum of business/social/invest/geo proximities in all deals Degree distribution: network density, companies with multiple deals (power law) Selective mixing: 50 states and 30 categories
  • #13 MZS: one paper said that you cannot sample…. scaling up ERGM is a problem to solve without decomposing the problem run R in distributed system
  • #14 one sample out of 100 samples
  • #15 Here is the model estimation result from M&A network. For computational feasibility, we did estimation for 25% sample for 100 times. Here we report % of samples with expected signs of proximity-related parameters. Business: 86%, social: 70%, invest: 51%, geo: 5%. Interpretation of theta: everything else holds equal, if forming a new edge increases proximity sum by 1, then the logit (log-odds) of if forming is theta.
  • #16 This is another result to consider the idea of complementarity. In other words, big companies may want to buy targets that are related but not too close to them. To test this idea, we added squared term of business proximity in this specification. Checking the signs of the original and squared terms….
  • #18 In this platform, you can search companies and their competitors based on the business proximity we propose.
  • #19 You may also search companies by selecting topics.
  • #20 Also, you can specify the topics of your interest and find relevant companies that fit your searching criteria. We believe that this is like Google search for high-tech industry. This may increase the M&A market efficiency.
  • #33 Then we are interested in the M&A network. Specifically, we look at M&A in the last two years. We started data collection at April 2013 and we try to use the initially collected data to predict the future activities. In total, there are 1,689 M&As and 394 are the recent ones.
  • #34 Here we validate the new business proximity measure by relating it to firm interactions. Specifically, we look at four groups of company pairs: M&A, invest, job movements, and random.
  • #36 Now that we know that our proposed proximity is an important factor in M&A network formation. We prototyped a platform based on the business proximity idea. Back end collects data about high-tech industry and build topic models to analyze the industry. And the front end provides a cloud-based interface in which people can navigate the industry.
  • #42 We propose a novel business proximity based on Latent Dirichlet Allocation. Companies’ business descriptions are the input to LDA. Then LDA produces (a) industry-wide topics and (b) topic distribution for each company. Then inter-firm proximity is calculated with cosine similarity between two topic distribution vectors.
  • #45 We propose a novel business proximity based on Latent Dirichlet Allocation. Companies’ business descriptions are the input to LDA. Then LDA produces (a) industry-wide topics and (b) topic distribution for each company. Then inter-firm proximity is calculated with cosine similarity between two topic distribution vectors.