Towards Advanced Business Analytics
with Text Mining and Deep Learning
Gene Moo Lee, Ph.D.
Assistant Professor of Information Systems
University of British Columbia
July 30, 2017
Gene Moo Lee, UBC Sauder, July 2017
I do research on Business Analytics
It is about analyzing Big Data for Business Decisions
The challenge is on unstructured Big Data (80-90%)
Solution: Machine learning, text mining, deep learning!
2
Gene Moo Lee, UBC Sauder, July 2017
Big Data Analytics
3
Business
Mobile Social
Internet Security
Gene Moo Lee, UBC Sauder, July 2017
Domain 1: Mobile Analytics
4
Business
Mobile Social
Internet Security
Matching Mobile Applications
for Cross Promotion
Gene Moo Lee (UT Arlington)
Joint Work with
Shu He (UConn)
Joowon Lee (Hansung U)
Andrew B. Whinston (UT Austin)
Working Paper
Gene Moo Lee, UBC Sauder, July 2017
Success of mobile app markets
0 200 400 600 800 1000 1200 1400 1600
Google Play
Apple App Store
Windows Phone
BlackBerry
# Apps (K)
6
Gene Moo Lee, UBC Sauder, July 2017
App ad channels: Cross Promotion
Mobile display ads
7
Cross promotions
(Incentivized)
In this work, we study cross promotions
• Two-sided matching: popular apps + new apps
• Incentivize app installs with rewards
Social network ads
Gene Moo Lee, UBC Sauder, July 2017
IGAWorks data
• IGAWorks: Mobile ad platform company in
Korea
• 1898 matches in cross promotions (Sept
2013 ~ May 2014)
• 325 mobile apps (195K apps with meta info)
• 1.1 million users
8
Gene Moo Lee, UBC Sauder, July 2017
User engagement
• “Free rider” issue
• Good news: good
matchings do much
better!
• Q: what makes a
good match?
• A: App similarity
9
Gene Moo Lee, UBC Sauder, July 2017
Topic models of app markets
• 195K apps’s descriptions —> 100 topics
• Challenge: Natural language processing for Korean texts
10
Music
Kids
Christian
피아노,sskin,클래식,flipfont,한국어,shw,아이콘,갤럭시,butterfly,사운
드,교향곡,개발자,베토벤,아름다운,모차르트,piano,공주님,입니다,배
경화면,이야기,
오디오,새찬송,저작권,즐겨찾기,교독문,북마크,통일찬송,주기도문,사도신
경,개역개정,개역개정판,성경사전,십계명,하이라이트,niv,콘텐츠,bull,아
가페,645곡,찬송가,
영단어,단어장,테스트,원어민,영어단어,toeic,뇌새김,우선순위,단어들,교
과서,학습법,시리즈,이미지,voca,재미있게,아니라,tts,학습자,구성되어,어
휘력,
뽀로로,친구들,애니메이션,있어요,노래해요,뽀롱뽀롱,아이들,대모험,싶을
때,콘텐츠,리스트,wwe,오프닝,홈쇼핑,british,이용하실,council,변신자동차
,
English
Gene Moo Lee, UBC Sauder, July 2017
Matching market design
• Design app matching market
• Extend model to many-to-many matching
• Use generalized deferred acceptance
algorithm
• Incorporate individual user profiling
• Conduct randomized field experiments
11
Gene Moo Lee, UBC Sauder, July 2017
Performance evaluation: 100% improvement
12
Gene Moo Lee, UBC Sauder, July 2017
Domain 2: Social Media Analytics
13
Business
Mobile Social
Internet Security
Content Complexity, Similarity,
Consistency in Social Media:
A Deep Learning Approach
Gene Moo Lee (UT Arlington)
Joint Work with
D. Shin (Amazon), S. He (UConn),
A. B. Whinston (UT Austin)
S. Cetintas, K.-C. Lee (Yahoo! Research)
Working Paper
Gene Moo Lee, UBC Sauder, July 2017
Social media: More users
15
Gene Moo Lee, UBC Sauder, July 2017
Social media: More spending
16
Gene Moo Lee, UBC Sauder, July 2017
Challenges and opportunities: 78% photos
17
Source: Chang et al. 2014
Gene Moo Lee, UBC Sauder, July 2017
Company blogs in Tumblr
18
BMW USA Vogue IBM
Gene Moo Lee, UBC Sauder, July 2017
Data: blog post and engagement
19
Post = Visual Info (Image) + Textual Info (Text, Tags)
Customer engagement = Notes (Likes + Reblogs)
Gene Moo Lee, UBC Sauder, July 2017
Research questions
• What kind of posts get more likes or shares?
• How visual and textual contents play role?
• How to construct measures on these
unstructured data sources?
20
Gene Moo Lee, UBC Sauder, July 2017
Tumblr data
• Tumblr: blogging platform (acquired by Yahoo!)
• 33,686 posts by 178 companies (May - Oct 2014)
• Automobile, Entertainment, Food, Fashion,
Finance, Leisure, Retail, Tech
• 89.7% photo & text, 6.3% pure text, 4% videos
• Collected “likes” and “reblogs” until Nov 2014
21
Gene Moo Lee, UBC Sauder, July 2017
Visual features
• Aesthetics (beautiful photos)
• Adult-contents
• Celebrity
• Feature complexity (low-level, flashy images)
• Semantic complexity (high-level, complex meaning)
• Number of salient objects
22
Gene Moo Lee, UBC Sauder, July 2017
Deep learning
• A branch of machine learning, inspired by human brain
• Algorithms to model high-level abstractions with multiple processing
layers of non-linear transformations
• (1) theoretical breakthroughs, (2) Big Data, (3) powerful computation
• Successfully applied in image/video/voice recognition, AlphaGo, etc.
23
Gene Moo Lee, UBC Sauder, July 2017
ImageNet: Image DB with tree-structure tags
24
Source: ImageNet
Gene Moo Lee, UBC Sauder, July 2017
More visual features
• 7th-layer output = robust representation of the image for “computer vision” tasks
• Aesthetic/beauty score [Dhar et al. 2011 (CVPR, Vision)]
• Adult-content score [Sengamedu et al. 2011 (MM, Vision)]
• Celebrity (450 celebrities) [Parhki et al. 2015 (BMV, Vision)]
• Number of salient objects [Zhang et al. 2015 (CVPR, Vision)]
25
Gene Moo Lee, UBC Sauder, July 2017
Examples: Visual features
• Visual complexity theory (Attneave 1994,
Donderi 2006, Pieters et al. 2010)
• Visual stimuli are a composite of
colors,luminance, shape, number of
objects/patterns
26
Gene Moo Lee, UBC Sauder, July 2017
Textual features
• Topics
• Word clusters
27
Gene Moo Lee, UBC Sauder, July 2017
Image-text similarity
• Topics
• Word clusters
28
Gene Moo Lee, UBC Sauder, July 2017
Empirical results
• [+] GIF, Beautiful/Adult images, Celebrity,
More images, Consistent image and text
• [-] Video, Semantic complexity, More
sentences, Questions, Asking Reblog
• Industry-specific results: e.g., adult images
only work for Fashion
• Different short and long term effects
• Different in hedonic and utilitarian products
29
Gene Moo Lee, UBC Sauder, July 2017
Domain 3: Business Analytics
30
Business
Mobile Social
Internet Security
Towards A Better Measure of
Business Proximity:
Topic Modeling for Industry Intelligence
Gene Moo Lee (UT Arlington)
Joint work with
Zhan Shi (Arizona State)
Andrew B. Whinston (UT Austin)
MIS Quarterly
Gene Moo Lee, UBC Sauder, July 2017
M&A in high-tech industry
32
• Understand M&A matching in high-tech industry
• Propose business proximity with machine learning
• Along with geographic, social, investment proximity
Gene Moo Lee, UBC Sauder, July 2017
Our approach on business proximity
• Approach: LDA topic modeling [Blei et al. 2003]
• Unsupervised learning to discover latent “topics” from a large
collection of documents
• Business proximity = cosine similarity of topic distributions
33
LDA
Industry-wide topics
Company’s topics
Company
descriptions
Gene Moo Lee, UBC Sauder, July 2017
CrunchBase data
• CrunchBase: open database (“Wikipedia”) of high-tech industry
• Data collection time: April 2013 ~ April 2015
• 24,382 U.S. high-tech companies (1.4% public, 5.7 years old)
• HQ location, CB-defined industry sector, key personnels, M&A,
investments, business summary
• Leading states: CA, NY, MA, TX
• Leading industries: software, web, e-commerce, ad, mobile
34
Gene Moo Lee, UBC Sauder, July 2017
LDA topic model with CrunchBase
35
Video/music
Energy
Sports
Healthcare
Gene Moo Lee, UBC Sauder, July 2017
Leading effect on business networks
• mean(business proximity)
• 0.293 (394 M&A pairs)
• 0.224 (129 invests pairs)
• 0.218 (9792 job mobility pairs)
• 0.068 (random pairs)
36
Gene Moo Lee, UBC Sauder, July 2017
Analysis on high-tech M&A network
• Objective: examine the relationship between likelihood of M&A
matching and nodal/dyadic characteristics
• Challenges: incorporate inter-relatedness of M&As
• Logit/probit cannot capture dependency between obs.
• Model all M&A deals as a graph with ERGM or p* model
37
Gene Moo Lee, UBC Sauder, July 2017
ERGM for M&A network
ERGM (Exponential Random Graph Model):
• Based on random graph [Erdos and Renyi 1959]
• Probability of realizing a graph = a function of the graph’s
statistics [Robins et al. 2007]
• Inter-firm proximity: business, geographic, social, co-invest
• Selective mixing: 50 states, 30 industry sectors
• Degree distribution: node degree, M&A experiences
38
degree selective mixing proximity
Gene Moo Lee, UBC Sauder, July 2017
Estimation setup
• Dataset
• US companies founded from 2008 to 2012: |V| = 24,382
• All dyadic/nodal attributes collected in April 2013
• M&A transactions (April 2013~April 2015): |E| = 394
• Estimate our ERGM M&A model
• Randomly sample 25% companies for computational feasibility
• Run 100 condor jobs with 100 sample graphs
• Estimate model coefficients by Markov chain Monte Carlo
(MCMC) maximum likelihood estimation (MLE)
39
Gene Moo Lee, UBC Sauder, July 2017
Empirical results: M&A & proximity
40
• Proximities are normalized for comparison
• 1.0 std increase in business proximity
= 3.64 std increase in social proximity
= 6.89 std increase in investment proximity
+
+
+
Gene Moo Lee, UBC Sauder, July 2017
Cloud-based platform design
41
Big Data and Cloud technologies: Cronjob, NoSQL, Python, Scala,
Condor, Google Cloud (Storage, App Engine, Datastore) and more
Gene Moo Lee, UBC Sauder, July 2017
Find competitors
● M&A market is a two-sided platform
o buyers: established companies
o sellers: startups
● We can increase the efficiency of this two-sided market by
o building interface, VentureMap, to make data accessible
o recommending matchings with our M&A model
● Potential beneficiaries
o Established firms: intelligence/M&A department
o Startups: identify opportunities, potential buyers
o Venture capitalists
o Market intelligence firms
o Researchers in finance field
42
Gene Moo Lee, UBC Sauder, July 2017
Search firms by business components
43
Gene Moo Lee, UBC Sauder, July 2017
Search firms by business components
44
Gene Moo Lee, UBC Sauder, July 2017
Domain 4: Cybersecurity Analytics
45
Business
Mobile Social
Internet Security
How Would Information Disclosure
Influence Organization’s
Outbound Spam Volume:
Evidence from a Field Experiment
Gene Moo Lee (UT Arlington)
Joint work with
Shu He (UConn)
Sukjin Han (UT Austin)
Andrew B. Whinston (UT Austin)
Journal of Cybersecurity
Gene Moo Lee, UBC Sauder, July 2017
Big Data in Action: Field Experiment
47
Improve Internet security with proactive security evaluation
1. Evaluate security levels with outbound spam
2. Randomization with spam volume, size, industry
3. Treatments: private info sharing + publicity
Gene Moo Lee, UBC Sauder, July 2017
http://cloud.spamrankings.net
48
Gene Moo Lee, UBC Sauder, July 2017
Experiment results
49
Gene Moo Lee, UBC Sauder, July 2017
Big Data Analytics in 4 domains
50
Business
Mobile Social
Internet Security
Gene Moo Lee, UBC Sauder, July 2017 51
Data Management
API, Web scraping
HDFS, NFS
RDBMS, NoSQL
Backup/Cloud Storage
…
Data Processing
Hadoop, Condor
Cloud Compute Engine
Sketch, Streaming
…
Machine Learning
Topic Model
NLP, Sentiment Analysis
Clustering
Deep Learning
…
Econ/Stats
ERGM, SNF
Panel Analysis
Game Theory
Randomization
…
Big Data Analytics Tools
Gene Moo Lee, UBC Sauder, July 2017
We are just scratching the surface now
• Big Data is mostly unstructured (80~90%)
• Text: social media, reviews, financial documents
• Photo: social media, product images
• Video: user-generated content, TV commercial
• Virtual reality (VR), augmented reality (AR)
• Now we’ve got Big Data and Analytics tools. There
are a lot of works to be done!
52
Gene Moo Lee, UBC Sauder, July 2017
Text Analysis on Annual Reports
• SEC 10-K documents from all US public firms
Thank you!
Contact Info: Gene Moo Lee
gene.lee@uta.edu
Gene Moo Lee, UBC Sauder, July 2017
Business proximity from topic model
• Business proximity pb(i,j) between firms i and j
• Cosine similarity of topic vectors Ti and Tj
• Range: 0 (no commonality) ~ 1 (same business components)
55
Gene Moo Lee, UBC Sauder, July 2017
Topic models and topic complexity
• Apply Latent Dirichlet Allocation (LDA) on text and tags in blog
posts [Blei et al. 2003]
1. Whole text collection (bag of words, word sequence doesn’t
matter) is described by topics
2. Each topic is a set of related keywords
3. Each post is represented by a topic vector
• Input: 33,686 blog posts
• Output: d topics, d-dimensional vector for each post
56LDA results
Gene Moo Lee, UBC Sauder, July 2017
Word2vec: Word embedding
57
• Word2vec [Mikolov et al. 2013] represents words in a vector space
where semantically similar words are nearby
• Assumption: Words appear in the same contexts share semantics
• Train a model that maximizes prediction of words co-occurrence
(3 words before/after the focal word): word order matters here
Source: TensorFlow
Gene Moo Lee, UBC Sauder, July 2017
Word2vec: Word embedding
58
• Word2vec [Mikolov et al. 2013] represents words in a vector space
where semantically similar words are nearby
• Assumption: Words appear in the same contexts share semantics
• Train a model that maximizes prediction of words co-occurrence
(3 words before/after the focal word): word order matters here
Source: TensorFlow
Gene Moo Lee, UBC Sauder, July 2017
Word2vec: order complexity
59
• Compute “1-(probability of a given of sentence)” based on learned
model
• Order complexity = how predictable a sentence is?
Gene Moo Lee, UBC Sauder, July 2017
Choosing the number of topics
• Perplexity criteria
• Commonly used to evaluate topic
models
• Measures how well the word
counts of held-out test documents
are matched by the word
distributions of the learned topics
• Lower the better
60
Text: 20 topics, Tags: 50 topics, Image-text-tags: 70 topics
Gene Moo Lee, UBC Sauder, July 2017
Robustness checks
1. Industry subsample analysis
2. Long- and short-term customer engagement
3. Categorize posts/blogs into ‘utilitarian’ vs ‘hedonic’
4. Examine non-linear effects
61

Towards Advanced Business Analytics using Text Mining and Deep Learning

  • 1.
    Towards Advanced BusinessAnalytics with Text Mining and Deep Learning Gene Moo Lee, Ph.D. Assistant Professor of Information Systems University of British Columbia July 30, 2017
  • 2.
    Gene Moo Lee,UBC Sauder, July 2017 I do research on Business Analytics It is about analyzing Big Data for Business Decisions The challenge is on unstructured Big Data (80-90%) Solution: Machine learning, text mining, deep learning! 2
  • 3.
    Gene Moo Lee,UBC Sauder, July 2017 Big Data Analytics 3 Business Mobile Social Internet Security
  • 4.
    Gene Moo Lee,UBC Sauder, July 2017 Domain 1: Mobile Analytics 4 Business Mobile Social Internet Security
  • 5.
    Matching Mobile Applications forCross Promotion Gene Moo Lee (UT Arlington) Joint Work with Shu He (UConn) Joowon Lee (Hansung U) Andrew B. Whinston (UT Austin) Working Paper
  • 6.
    Gene Moo Lee,UBC Sauder, July 2017 Success of mobile app markets 0 200 400 600 800 1000 1200 1400 1600 Google Play Apple App Store Windows Phone BlackBerry # Apps (K) 6
  • 7.
    Gene Moo Lee,UBC Sauder, July 2017 App ad channels: Cross Promotion Mobile display ads 7 Cross promotions (Incentivized) In this work, we study cross promotions • Two-sided matching: popular apps + new apps • Incentivize app installs with rewards Social network ads
  • 8.
    Gene Moo Lee,UBC Sauder, July 2017 IGAWorks data • IGAWorks: Mobile ad platform company in Korea • 1898 matches in cross promotions (Sept 2013 ~ May 2014) • 325 mobile apps (195K apps with meta info) • 1.1 million users 8
  • 9.
    Gene Moo Lee,UBC Sauder, July 2017 User engagement • “Free rider” issue • Good news: good matchings do much better! • Q: what makes a good match? • A: App similarity 9
  • 10.
    Gene Moo Lee,UBC Sauder, July 2017 Topic models of app markets • 195K apps’s descriptions —> 100 topics • Challenge: Natural language processing for Korean texts 10 Music Kids Christian 피아노,sskin,클래식,flipfont,한국어,shw,아이콘,갤럭시,butterfly,사운 드,교향곡,개발자,베토벤,아름다운,모차르트,piano,공주님,입니다,배 경화면,이야기, 오디오,새찬송,저작권,즐겨찾기,교독문,북마크,통일찬송,주기도문,사도신 경,개역개정,개역개정판,성경사전,십계명,하이라이트,niv,콘텐츠,bull,아 가페,645곡,찬송가, 영단어,단어장,테스트,원어민,영어단어,toeic,뇌새김,우선순위,단어들,교 과서,학습법,시리즈,이미지,voca,재미있게,아니라,tts,학습자,구성되어,어 휘력, 뽀로로,친구들,애니메이션,있어요,노래해요,뽀롱뽀롱,아이들,대모험,싶을 때,콘텐츠,리스트,wwe,오프닝,홈쇼핑,british,이용하실,council,변신자동차 , English
  • 11.
    Gene Moo Lee,UBC Sauder, July 2017 Matching market design • Design app matching market • Extend model to many-to-many matching • Use generalized deferred acceptance algorithm • Incorporate individual user profiling • Conduct randomized field experiments 11
  • 12.
    Gene Moo Lee,UBC Sauder, July 2017 Performance evaluation: 100% improvement 12
  • 13.
    Gene Moo Lee,UBC Sauder, July 2017 Domain 2: Social Media Analytics 13 Business Mobile Social Internet Security
  • 14.
    Content Complexity, Similarity, Consistencyin Social Media: A Deep Learning Approach Gene Moo Lee (UT Arlington) Joint Work with D. Shin (Amazon), S. He (UConn), A. B. Whinston (UT Austin) S. Cetintas, K.-C. Lee (Yahoo! Research) Working Paper
  • 15.
    Gene Moo Lee,UBC Sauder, July 2017 Social media: More users 15
  • 16.
    Gene Moo Lee,UBC Sauder, July 2017 Social media: More spending 16
  • 17.
    Gene Moo Lee,UBC Sauder, July 2017 Challenges and opportunities: 78% photos 17 Source: Chang et al. 2014
  • 18.
    Gene Moo Lee,UBC Sauder, July 2017 Company blogs in Tumblr 18 BMW USA Vogue IBM
  • 19.
    Gene Moo Lee,UBC Sauder, July 2017 Data: blog post and engagement 19 Post = Visual Info (Image) + Textual Info (Text, Tags) Customer engagement = Notes (Likes + Reblogs)
  • 20.
    Gene Moo Lee,UBC Sauder, July 2017 Research questions • What kind of posts get more likes or shares? • How visual and textual contents play role? • How to construct measures on these unstructured data sources? 20
  • 21.
    Gene Moo Lee,UBC Sauder, July 2017 Tumblr data • Tumblr: blogging platform (acquired by Yahoo!) • 33,686 posts by 178 companies (May - Oct 2014) • Automobile, Entertainment, Food, Fashion, Finance, Leisure, Retail, Tech • 89.7% photo & text, 6.3% pure text, 4% videos • Collected “likes” and “reblogs” until Nov 2014 21
  • 22.
    Gene Moo Lee,UBC Sauder, July 2017 Visual features • Aesthetics (beautiful photos) • Adult-contents • Celebrity • Feature complexity (low-level, flashy images) • Semantic complexity (high-level, complex meaning) • Number of salient objects 22
  • 23.
    Gene Moo Lee,UBC Sauder, July 2017 Deep learning • A branch of machine learning, inspired by human brain • Algorithms to model high-level abstractions with multiple processing layers of non-linear transformations • (1) theoretical breakthroughs, (2) Big Data, (3) powerful computation • Successfully applied in image/video/voice recognition, AlphaGo, etc. 23
  • 24.
    Gene Moo Lee,UBC Sauder, July 2017 ImageNet: Image DB with tree-structure tags 24 Source: ImageNet
  • 25.
    Gene Moo Lee,UBC Sauder, July 2017 More visual features • 7th-layer output = robust representation of the image for “computer vision” tasks • Aesthetic/beauty score [Dhar et al. 2011 (CVPR, Vision)] • Adult-content score [Sengamedu et al. 2011 (MM, Vision)] • Celebrity (450 celebrities) [Parhki et al. 2015 (BMV, Vision)] • Number of salient objects [Zhang et al. 2015 (CVPR, Vision)] 25
  • 26.
    Gene Moo Lee,UBC Sauder, July 2017 Examples: Visual features • Visual complexity theory (Attneave 1994, Donderi 2006, Pieters et al. 2010) • Visual stimuli are a composite of colors,luminance, shape, number of objects/patterns 26
  • 27.
    Gene Moo Lee,UBC Sauder, July 2017 Textual features • Topics • Word clusters 27
  • 28.
    Gene Moo Lee,UBC Sauder, July 2017 Image-text similarity • Topics • Word clusters 28
  • 29.
    Gene Moo Lee,UBC Sauder, July 2017 Empirical results • [+] GIF, Beautiful/Adult images, Celebrity, More images, Consistent image and text • [-] Video, Semantic complexity, More sentences, Questions, Asking Reblog • Industry-specific results: e.g., adult images only work for Fashion • Different short and long term effects • Different in hedonic and utilitarian products 29
  • 30.
    Gene Moo Lee,UBC Sauder, July 2017 Domain 3: Business Analytics 30 Business Mobile Social Internet Security
  • 31.
    Towards A BetterMeasure of Business Proximity: Topic Modeling for Industry Intelligence Gene Moo Lee (UT Arlington) Joint work with Zhan Shi (Arizona State) Andrew B. Whinston (UT Austin) MIS Quarterly
  • 32.
    Gene Moo Lee,UBC Sauder, July 2017 M&A in high-tech industry 32 • Understand M&A matching in high-tech industry • Propose business proximity with machine learning • Along with geographic, social, investment proximity
  • 33.
    Gene Moo Lee,UBC Sauder, July 2017 Our approach on business proximity • Approach: LDA topic modeling [Blei et al. 2003] • Unsupervised learning to discover latent “topics” from a large collection of documents • Business proximity = cosine similarity of topic distributions 33 LDA Industry-wide topics Company’s topics Company descriptions
  • 34.
    Gene Moo Lee,UBC Sauder, July 2017 CrunchBase data • CrunchBase: open database (“Wikipedia”) of high-tech industry • Data collection time: April 2013 ~ April 2015 • 24,382 U.S. high-tech companies (1.4% public, 5.7 years old) • HQ location, CB-defined industry sector, key personnels, M&A, investments, business summary • Leading states: CA, NY, MA, TX • Leading industries: software, web, e-commerce, ad, mobile 34
  • 35.
    Gene Moo Lee,UBC Sauder, July 2017 LDA topic model with CrunchBase 35 Video/music Energy Sports Healthcare
  • 36.
    Gene Moo Lee,UBC Sauder, July 2017 Leading effect on business networks • mean(business proximity) • 0.293 (394 M&A pairs) • 0.224 (129 invests pairs) • 0.218 (9792 job mobility pairs) • 0.068 (random pairs) 36
  • 37.
    Gene Moo Lee,UBC Sauder, July 2017 Analysis on high-tech M&A network • Objective: examine the relationship between likelihood of M&A matching and nodal/dyadic characteristics • Challenges: incorporate inter-relatedness of M&As • Logit/probit cannot capture dependency between obs. • Model all M&A deals as a graph with ERGM or p* model 37
  • 38.
    Gene Moo Lee,UBC Sauder, July 2017 ERGM for M&A network ERGM (Exponential Random Graph Model): • Based on random graph [Erdos and Renyi 1959] • Probability of realizing a graph = a function of the graph’s statistics [Robins et al. 2007] • Inter-firm proximity: business, geographic, social, co-invest • Selective mixing: 50 states, 30 industry sectors • Degree distribution: node degree, M&A experiences 38 degree selective mixing proximity
  • 39.
    Gene Moo Lee,UBC Sauder, July 2017 Estimation setup • Dataset • US companies founded from 2008 to 2012: |V| = 24,382 • All dyadic/nodal attributes collected in April 2013 • M&A transactions (April 2013~April 2015): |E| = 394 • Estimate our ERGM M&A model • Randomly sample 25% companies for computational feasibility • Run 100 condor jobs with 100 sample graphs • Estimate model coefficients by Markov chain Monte Carlo (MCMC) maximum likelihood estimation (MLE) 39
  • 40.
    Gene Moo Lee,UBC Sauder, July 2017 Empirical results: M&A & proximity 40 • Proximities are normalized for comparison • 1.0 std increase in business proximity = 3.64 std increase in social proximity = 6.89 std increase in investment proximity + + +
  • 41.
    Gene Moo Lee,UBC Sauder, July 2017 Cloud-based platform design 41 Big Data and Cloud technologies: Cronjob, NoSQL, Python, Scala, Condor, Google Cloud (Storage, App Engine, Datastore) and more
  • 42.
    Gene Moo Lee,UBC Sauder, July 2017 Find competitors ● M&A market is a two-sided platform o buyers: established companies o sellers: startups ● We can increase the efficiency of this two-sided market by o building interface, VentureMap, to make data accessible o recommending matchings with our M&A model ● Potential beneficiaries o Established firms: intelligence/M&A department o Startups: identify opportunities, potential buyers o Venture capitalists o Market intelligence firms o Researchers in finance field 42
  • 43.
    Gene Moo Lee,UBC Sauder, July 2017 Search firms by business components 43
  • 44.
    Gene Moo Lee,UBC Sauder, July 2017 Search firms by business components 44
  • 45.
    Gene Moo Lee,UBC Sauder, July 2017 Domain 4: Cybersecurity Analytics 45 Business Mobile Social Internet Security
  • 46.
    How Would InformationDisclosure Influence Organization’s Outbound Spam Volume: Evidence from a Field Experiment Gene Moo Lee (UT Arlington) Joint work with Shu He (UConn) Sukjin Han (UT Austin) Andrew B. Whinston (UT Austin) Journal of Cybersecurity
  • 47.
    Gene Moo Lee,UBC Sauder, July 2017 Big Data in Action: Field Experiment 47 Improve Internet security with proactive security evaluation 1. Evaluate security levels with outbound spam 2. Randomization with spam volume, size, industry 3. Treatments: private info sharing + publicity
  • 48.
    Gene Moo Lee,UBC Sauder, July 2017 http://cloud.spamrankings.net 48
  • 49.
    Gene Moo Lee,UBC Sauder, July 2017 Experiment results 49
  • 50.
    Gene Moo Lee,UBC Sauder, July 2017 Big Data Analytics in 4 domains 50 Business Mobile Social Internet Security
  • 51.
    Gene Moo Lee,UBC Sauder, July 2017 51 Data Management API, Web scraping HDFS, NFS RDBMS, NoSQL Backup/Cloud Storage … Data Processing Hadoop, Condor Cloud Compute Engine Sketch, Streaming … Machine Learning Topic Model NLP, Sentiment Analysis Clustering Deep Learning … Econ/Stats ERGM, SNF Panel Analysis Game Theory Randomization … Big Data Analytics Tools
  • 52.
    Gene Moo Lee,UBC Sauder, July 2017 We are just scratching the surface now • Big Data is mostly unstructured (80~90%) • Text: social media, reviews, financial documents • Photo: social media, product images • Video: user-generated content, TV commercial • Virtual reality (VR), augmented reality (AR) • Now we’ve got Big Data and Analytics tools. There are a lot of works to be done! 52
  • 53.
    Gene Moo Lee,UBC Sauder, July 2017 Text Analysis on Annual Reports • SEC 10-K documents from all US public firms
  • 54.
    Thank you! Contact Info:Gene Moo Lee gene.lee@uta.edu
  • 55.
    Gene Moo Lee,UBC Sauder, July 2017 Business proximity from topic model • Business proximity pb(i,j) between firms i and j • Cosine similarity of topic vectors Ti and Tj • Range: 0 (no commonality) ~ 1 (same business components) 55
  • 56.
    Gene Moo Lee,UBC Sauder, July 2017 Topic models and topic complexity • Apply Latent Dirichlet Allocation (LDA) on text and tags in blog posts [Blei et al. 2003] 1. Whole text collection (bag of words, word sequence doesn’t matter) is described by topics 2. Each topic is a set of related keywords 3. Each post is represented by a topic vector • Input: 33,686 blog posts • Output: d topics, d-dimensional vector for each post 56LDA results
  • 57.
    Gene Moo Lee,UBC Sauder, July 2017 Word2vec: Word embedding 57 • Word2vec [Mikolov et al. 2013] represents words in a vector space where semantically similar words are nearby • Assumption: Words appear in the same contexts share semantics • Train a model that maximizes prediction of words co-occurrence (3 words before/after the focal word): word order matters here Source: TensorFlow
  • 58.
    Gene Moo Lee,UBC Sauder, July 2017 Word2vec: Word embedding 58 • Word2vec [Mikolov et al. 2013] represents words in a vector space where semantically similar words are nearby • Assumption: Words appear in the same contexts share semantics • Train a model that maximizes prediction of words co-occurrence (3 words before/after the focal word): word order matters here Source: TensorFlow
  • 59.
    Gene Moo Lee,UBC Sauder, July 2017 Word2vec: order complexity 59 • Compute “1-(probability of a given of sentence)” based on learned model • Order complexity = how predictable a sentence is?
  • 60.
    Gene Moo Lee,UBC Sauder, July 2017 Choosing the number of topics • Perplexity criteria • Commonly used to evaluate topic models • Measures how well the word counts of held-out test documents are matched by the word distributions of the learned topics • Lower the better 60 Text: 20 topics, Tags: 50 topics, Image-text-tags: 70 topics
  • 61.
    Gene Moo Lee,UBC Sauder, July 2017 Robustness checks 1. Industry subsample analysis 2. Long- and short-term customer engagement 3. Categorize posts/blogs into ‘utilitarian’ vs ‘hedonic’ 4. Examine non-linear effects 61

Editor's Notes

  • #7 We all observe that app marketplaces are very successful. Google Android has about 1.5 million apps, Apple iOS about 1.3 million. Microsoft’s Windows Phone is trying hard to catch up.
  • #8 For app developers, there are thee main ways to promote their new apps. Mobile display ads… Cross promotion, you make a deal with established apps and promote your new one in them. Each install you get, you pay the established apps. Installs are incentivized by rewards.
  • #12 This is the design of the app matching market. I will develop general theoretical framework to serve many-to-one, then many-to-many matching. The engine will incorporate the matching algorithm with nice properties, such as stability and monotinicity. We will apply this design with the company to conduct field experiments.
  • #13 This is the design of the app matching market. I will develop general theoretical framework to serve many-to-one, then many-to-many matching. The engine will incorporate the matching algorithm with nice properties, such as stability and monotinicity. We will apply this design with the company to conduct field experiments.
  • #34 We propose a novel business proximity based on Latent Dirichlet Allocation. Companies’ business descriptions are the input to LDA. Then LDA produces (a) industry-wide topics and (b) topic distribution for each company. Then inter-firm proximity is calculated with cosine similarity between two topic distribution vectors.
  • #35 Introducing the data, we collected data from CrunchBase, which is the “Wikipedia” for high-tech industry. It has information on 24K U.S. companies, including locations, key people, transactions, and business summary. California, New York are the hubs. Software, web, e-commerce are the leading industry sectors.
  • #36 Giving you a sense on the constructed topic model, here is the topic model built with companies’ business descriptions. Let’s take a look at some topics.
  • #37 Here we validate the new business proximity measure by relating it to firm interactions. Specifically, we look at four groups of company pairs: M&A, invest, job movements, and random.
  • #38 Now here is our research question: What are the underlying driving forces that realized the M&A network? There are three challenges we try to approach.
  • #39 Based on the constructed proximity, the next thing to do is to model the M&A network. To incorporate the interdependency of different M&A transactions, we employ exponential random graph model (ERGM). The idea is that: given a set of nodes, the probability of realizing a specific graph is a function of various graph’s statistics. Proximity: p = sum of business/social/invest/geo proximities in all deals Degree distribution: network density, companies with multiple deals (power law) Selective mixing: 50 states and 30 categories
  • #40 MZS: one paper said that you cannot sample…. scaling up ERGM is a problem to solve without decomposing the problem run R in distributed system
  • #41 Here is the model estimation result from M&A network. For computational feasibility, we did estimation for 25% sample for 100 times. Here we report % of samples with expected signs of proximity-related parameters. Business: 86%, social: 70%, invest: 51%, geo: 5%. Interpretation of theta: everything else holds equal, if forming a new edge increases proximity sum by 1, then the logit (log-odds) of if forming is theta.
  • #42 Now that we know that our proposed proximity is an important factor in M&A network formation. We prototyped a platform based on the business proximity idea. Back end collects data about high-tech industry and build topic models to analyze the industry. And the front end provides a cloud-based interface in which people can navigate the industry.
  • #43 In this platform, you can search companies and their competitors based on the business proximity we propose.
  • #44 You may also search companies by selecting topics.
  • #45 Also, you can specify the topics of your interest and find relevant companies that fit your searching criteria. We believe that this is like Google search for high-tech industry. This may increase the M&A market efficiency.