Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data-Driven Disruption:
Lessons from Silicon Valley
Anand Rajaraman
The Rise of Data Driven Disruption
2
50-fold Growth from 2010 to 2020
3
2014: More
bits in the
digital universe
than stars in
the physical
universe
Sources of Data
• The world creates 1.7MB of data per minute per person
4The Digital Universe -- IDC Report, 2014
Data-Driven Applications
5
Data-Driven Applications
Talk outline
• The evolution of data-driven applications
• 5 generations
• Lessons and Opportunities
• From the intersecti...
THE EVOLUTION OF
DATA-DRIVEN APPS
7
Follow the Data!
• Value-creation has followed the most
valuable data sources available!
• 5 overlapping generations
8
Data driven apps: The First Generation
• All about leveraging private, structured data
assets for competitive advantage
• ...
Data-driven apps: The Second
Generation
• Harnessing the power of public data
10
Data-Driven Apps: The Third Generation
• Leveraging the power of “semi-public”
Social + Mobile Data
• Personal data shared...
Third Generation Examples
12
Data-driven apps: The Fourth Generation
• Combining public, semi-public, and private
data
13
+
4G Example: Paysa
14
• Am I being compensated fairly?
• 2012 Stanford CS grad
• Java, C++, Ruby, and Machine Learning
• So...
4G Example: Paysa
15
Salaries
35M+ salary
datapoints
Companies
500k+
companies
People
Professional
DNA of
15M tech
employe...
The Fifth Generation: Just add AI!
16
• Companies generate massive amounts of
training data
• New class of proprietary data
The Fifth Generation
17
+
Fifth Generation Examples
18
Summary: Follow the Data!
19
LESSONS AND
OPPORTUNITIES
20
Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration...
Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration...
23
3 broad categories:
Infrastructure
Analytics
Intelligent Applications
Infrastructure
• Accessed primarily by developers
24
Analytics
• Data exploration and modeling for data
scientists and business people
25
Vertical Analytics: Cuberon
26
The “Why?” Question
• Why are signups
down this week?
• Why did this
marketing campaign
do so well?
• Why did this A/B tes...
Consumer Behavior Analytics: Cuberon
28
Build data cube
Identify
anomalous
subcubes
Intelligent Applications
29Matt Turck, Jim Hao & FirstMark Capital
More Intelligent Applications…
30Matt Turck, Jim Hao & FirstMark Capital
Intelligent App Example: Descartes Labs
31Another example: Zillow
Trends and Takeaways
• Infrastructure is available and solid
• Major transition from Hadoop to Spark
• Investment focus on...
Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration...
Data-driven Optimization
34EMC: Understanding Data Lakes
Data-driven Disruption
35
Beware the Hippo
HiPPO = Highest Paid Person’s Opinion
36
Why does disruption happen?
• Data scientist as advisor not decision maker
• Domain expertise and experience often win out...
Why does disruption happen?
• Classic Innovator’s Dilemma with a turbo-
boost: data network effects
• Accelerates the pace...
Disruption Example: Venture Capital
• Venture Capital has been an established
industry for several decades
• Process has n...
Sets the stage for…
40
rocketship.vc
Venture Investing through Data Science
More Global Startups
41
Reduced costs to launch a startup
Large consolidating markets;
smartphone ubiquity
Emerging Market...
Beyond Human Scale
42
2.1 Million “Startups”
115K need funding at any time
90% outside Silicon Valley
12.8 Million Compani...
Why Data-Driven? Geography
43
0
10
20
30
40
50
60
70
80
90
100
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014...
The Company Model
44
Company
ModelTraction
Team
Market
Competition
Customer
Feedback
Business Model Innovation
• Proactively identify interesting companies and
reach out to them at the appropriate moment
45
...
Optimize or Disrupt?
• Key question for every entrepreneur (and
researcher too!)
• Often difference between success and fa...
Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration...
Current view of Human-Machine
Collaboration
4810Clouds Blog
But what about…
49
rocketship.vc
Peripheral Vision
• To make optimal decisions, humans must
provide “peripheral vision” to model
• Is this data point an ou...
The Problem
•Must judges, policemen,
doctors, bureaucrats
understand the nuances of
the data and the model?
•Even trickier...
The Opportunity
• Systems that include humans and models
as peers
• Can also be complex workflows that involve many
humans...
Is it time to disrupt Mechanical Turk?
• The world has changed a lot
since Mechanical Turk was
introduced in 2005
• Can we...
Lessons and Opportunities
1. Startup and Investment Landscape
2. Disruption vs Optimization
3. Human-Machine Collaboration...
Data-driven software all around us…
55
The Agency Problem
•Each model is optimized
for the good of the
company that owns it
•Often our goals and the
company’s go...
Problems
• Privacy
• Everyone has your data and is modeling your actions
• Pricing and Discovery disadvantage
• You discov...
We have helped create this situation
vs
Wooden weapons against guns and steel
59Conquistadors and Incas -- Painting by John Everett Millais
Or if you prefer…
60South Park
Enter the Cyborg
61
Cyborg Layer mediates interactions
62
Cyborg Layer Services
• Privacy protection
• e.g., using Differential Privacy techniques
• Or by strategically spreading i...
Combining Personal and Population
Models
64
Lessons and Opportunities
1. The Age of the App
2. Disruption vs Optimization
3. Human-Machine Collaboration
4. The Rise o...
How to build a Model: Conventional View
• Use ground truth to build the best model
possible
• Feature engineering + model ...
Example: Troo.ly
2005
TRANSACTIONS
2015
EXPERIENCES
Need for online trust has
grown dramatically!
Would you rent your hous...
WHAT WE
ARE GIVEN
Troo.ly Problem Statement
KNOWN
BAD
KNOWN
GOOD
NOT
KNOWN
Can you trust the ground truth?
!
Bad users might have a good label if they haven’t
engaged in bad activity yet
Labels may...
Rocketship.vc: company data
70
• How to tradeoff data sources
based on Coverage, Accuracy,
Depth, Freshness, and Cost?
• W...
Algorithmic Law Enforcement
71The Economist, August 20, 2016
But what about perpetuating
bias against minorities?
Summary
• Cannot trust the given data completely
• Ground truth is often neither true nor grounded
• Data may have bias
• ...
CONCLUSION
73
Summary
• 5 generations of data-driven applications
• Lessons and Opportunities
1. The Age of the Intelligent App
2. Disru...
Identity Crisis?
75
Data Management
Semantic
Web
Machine
Learning
Data Mining
Information
Retrieval
AI
Systems
Panel at No...
Marketing Myopia
76Marketing Myopia, Theodore Levitt. HBS Case Study, 1960
Data impacts every human endeavor
77
Data
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security...
Data + X
• Core identity of the field is to create value
from data
• Never a better time for it!
• Data is now a key part ...
Go Forth And Disrupt!
79
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security
Commerce
ANNOUNCEMENT
80
IIT Madras CS Visiting Chair Program
• Focus area: data-driven
approaches to tackle important
problems
• Leading faculty/r...
Confirmed Visiting Chairs so far…
82
Jeff Ullman
Professor Emeritus, CS
Stanford
Randy Katz
Distinguished Professor, EECS
...
For more information
deaniar@iitm.ac.in
83
Prof. Nagarajan
Thanks!
Anand Rajaraman
datawocky@gmail.com
@anand_raj
Upcoming SlideShare
Loading in …5
×

Disrupting with Data: Lessons from Silicon Valley

350 views

Published on

Anand Rajaraman's keynote address at VLDB 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Disrupting with Data: Lessons from Silicon Valley

  1. 1. Data-Driven Disruption: Lessons from Silicon Valley Anand Rajaraman
  2. 2. The Rise of Data Driven Disruption 2
  3. 3. 50-fold Growth from 2010 to 2020 3 2014: More bits in the digital universe than stars in the physical universe
  4. 4. Sources of Data • The world creates 1.7MB of data per minute per person 4The Digital Universe -- IDC Report, 2014
  5. 5. Data-Driven Applications 5 Data-Driven Applications
  6. 6. Talk outline • The evolution of data-driven applications • 5 generations • Lessons and Opportunities • From the intersection of startups, venture capital, and research • Key theme: Disruption vs Optimization • Conclusion 6
  7. 7. THE EVOLUTION OF DATA-DRIVEN APPS 7
  8. 8. Follow the Data! • Value-creation has followed the most valuable data sources available! • 5 overlapping generations 8
  9. 9. Data driven apps: The First Generation • All about leveraging private, structured data assets for competitive advantage • E.g., Sales, inventory, payroll, … 9
  10. 10. Data-driven apps: The Second Generation • Harnessing the power of public data 10
  11. 11. Data-Driven Apps: The Third Generation • Leveraging the power of “semi-public” Social + Mobile Data • Personal data shared in a frictionless manner with user’s consent 11
  12. 12. Third Generation Examples 12
  13. 13. Data-driven apps: The Fourth Generation • Combining public, semi-public, and private data 13 +
  14. 14. 4G Example: Paysa 14 • Am I being compensated fairly? • 2012 Stanford CS grad • Java, C++, Ruby, and Machine Learning • Software Eng II at Google
  15. 15. 4G Example: Paysa 15 Salaries 35M+ salary datapoints Companies 500k+ companies People Professional DNA of 15M tech employees Jobs Millions of job postings updated daily Local/National Government Databases Partnerships (e.g., Udacity) Recruiters Companies Web Crawl Social Media Private Public
  16. 16. The Fifth Generation: Just add AI! 16 • Companies generate massive amounts of training data • New class of proprietary data
  17. 17. The Fifth Generation 17 +
  18. 18. Fifth Generation Examples 18
  19. 19. Summary: Follow the Data! 19
  20. 20. LESSONS AND OPPORTUNITIES 20
  21. 21. Lessons and Opportunities 1. Startup and Investment Landscape 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 21
  22. 22. Lessons and Opportunities 1. Startup and Investment Landscape 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 22
  23. 23. 23 3 broad categories: Infrastructure Analytics Intelligent Applications
  24. 24. Infrastructure • Accessed primarily by developers 24
  25. 25. Analytics • Data exploration and modeling for data scientists and business people 25
  26. 26. Vertical Analytics: Cuberon 26
  27. 27. The “Why?” Question • Why are signups down this week? • Why did this marketing campaign do so well? • Why did this A/B test not perform? 27
  28. 28. Consumer Behavior Analytics: Cuberon 28 Build data cube Identify anomalous subcubes
  29. 29. Intelligent Applications 29Matt Turck, Jim Hao & FirstMark Capital
  30. 30. More Intelligent Applications… 30Matt Turck, Jim Hao & FirstMark Capital
  31. 31. Intelligent App Example: Descartes Labs 31Another example: Zillow
  32. 32. Trends and Takeaways • Infrastructure is available and solid • Major transition from Hadoop to Spark • Investment focus on “Vertical” analytics plays • e.g., Cuberon, Ayasdi • The Age of the Intelligent App has dawned • Major opportunities and investment dollars flowing here! • e.g., Troo.ly, Descartes Labs, DocsApp 32
  33. 33. Lessons and Opportunities 1. Startup and Investment Landscape 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 33
  34. 34. Data-driven Optimization 34EMC: Understanding Data Lakes
  35. 35. Data-driven Disruption 35
  36. 36. Beware the Hippo HiPPO = Highest Paid Person’s Opinion 36
  37. 37. Why does disruption happen? • Data scientist as advisor not decision maker • Domain expertise and experience often win out over data • Data-driven approach enables a completely different business model • E.g., A la carte streaming vs fixed number of channels • Cannibalization concerns • Fear of making mistakes • Algorithms can make mistakes • But algorithms can learn and improve much faster with data! 37
  38. 38. Why does disruption happen? • Classic Innovator’s Dilemma with a turbo- boost: data network effects • Accelerates the pace of disruption 38
  39. 39. Disruption Example: Venture Capital • Venture Capital has been an established industry for several decades • Process has not changed much since early days • VC firms expect entrepreneurs to approach them with pitches • Some VC firms have tried using data • Data scientists in advisory role • Not partners who make investment decisions • High concentration in Silicon Valley • And a few other places… 39
  40. 40. Sets the stage for… 40 rocketship.vc Venture Investing through Data Science
  41. 41. More Global Startups 41 Reduced costs to launch a startup Large consolidating markets; smartphone ubiquity Emerging Market Opportunities Untapped talent pools
  42. 42. Beyond Human Scale 42 2.1 Million “Startups” 115K need funding at any time 90% outside Silicon Valley 12.8 Million Companies
  43. 43. Why Data-Driven? Geography 43 0 10 20 30 40 50 60 70 80 90 100 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Count Number of $B companies by year Silicon Valley Outside Silicon Valley
  44. 44. The Company Model 44 Company ModelTraction Team Market Competition Customer Feedback
  45. 45. Business Model Innovation • Proactively identify interesting companies and reach out to them at the appropriate moment 45 South America 9% East Europe 11% China 13% India 7%Other East Asia 11% Other Europe 5% Other North America 7% US SF 11% US Other 22% Unknown 4%
  46. 46. Optimize or Disrupt? • Key question for every entrepreneur (and researcher too!) • Often difference between success and failure • Hard to answer in general, but look out for disruption cues • Established, fragmented industry • Slow to adopt latest technology trend • Asset-heavy models • Risk/reward tradeoff • Disruption is much riskier but the rewards compensate 46
  47. 47. Lessons and Opportunities 1. Startup and Investment Landscape 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 47
  48. 48. Current view of Human-Machine Collaboration 4810Clouds Blog
  49. 49. But what about… 49 rocketship.vc
  50. 50. Peripheral Vision • To make optimal decisions, humans must provide “peripheral vision” to model • Is this data point an outlier or does it fit the model? • e.g., Geo or category in VC • Is there bias in the model? • e.g., historical racial gap in sentencing and parole decisions • Has the world changed in a way that invalidates the assumption of the model? • e.g., flash crash on Wall Street 50
  51. 51. The Problem •Must judges, policemen, doctors, bureaucrats understand the nuances of the data and the model? •Even trickier when we consider complex workflows involving multiple decision makers • e.g., a drug trial 51
  52. 52. The Opportunity • Systems that include humans and models as peers • Can also be complex workflows that involve many humans and models • How best to structure such systems to produce optimal decisions? • Model might need to be tuned to work with specific human • Model Invalidation • Can models know when they are no longer valid? 52
  53. 53. Is it time to disrupt Mechanical Turk? • The world has changed a lot since Mechanical Turk was introduced in 2005 • Can we move closer to true hybrid human-machine computing? • Harness both human initiative and computing power • Harness sensors in phones • Reimagine problems, tasks and incentives 53
  54. 54. Lessons and Opportunities 1. Startup and Investment Landscape 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 54
  55. 55. Data-driven software all around us… 55
  56. 56. The Agency Problem •Each model is optimized for the good of the company that owns it •Often our goals and the company’s goals are in alignment but not always! 56
  57. 57. Problems • Privacy • Everyone has your data and is modeling your actions • Pricing and Discovery disadvantage • You discover only what they choose to show you • You are not a population • Each service models its population of users • And is optimizing for its own ends • Would you rather be explored or exploited? 57
  58. 58. We have helped create this situation vs
  59. 59. Wooden weapons against guns and steel 59Conquistadors and Incas -- Painting by John Everett Millais
  60. 60. Or if you prefer… 60South Park
  61. 61. Enter the Cyborg 61
  62. 62. Cyborg Layer mediates interactions 62
  63. 63. Cyborg Layer Services • Privacy protection • e.g., using Differential Privacy techniques • Or by strategically spreading interactions across services • e.g., watch some movies on Netflix and some on Amazon • Discovery and Pricing • Looks at a larger selection and picks items for you • Acts strictly as your agent; no conflict • Combine personal and population models • Cyborg has complete access to all my data • External services have population data, but only limited window 63
  64. 64. Combining Personal and Population Models 64
  65. 65. Lessons and Opportunities 1. The Age of the App 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. The Rise of the Cyborg 5. The Data is not a Given 65
  66. 66. How to build a Model: Conventional View • Use ground truth to build the best model possible • Feature engineering + model selection • Maybe some data cleaning and integration 66
  67. 67. Example: Troo.ly 2005 TRANSACTIONS 2015 EXPERIENCES Need for online trust has grown dramatically! Would you rent your house to this stranger?
  68. 68. WHAT WE ARE GIVEN Troo.ly Problem Statement KNOWN BAD KNOWN GOOD NOT KNOWN
  69. 69. Can you trust the ground truth? ! Bad users might have a good label if they haven’t engaged in bad activity yet Labels may be incorrect if they are coming from bad internal models Labels may be incorrect because of wrong attributions in bad transactions ! !
  70. 70. Rocketship.vc: company data 70 • How to tradeoff data sources based on Coverage, Accuracy, Depth, Freshness, and Cost? • Which subset of data sources yields the best model? • Which subset of data sources will identify promising companies most quickly? • Promising start • Dong et al, VLDB 2012 • Rekatsinas et al, SIGMOD 2014
  71. 71. Algorithmic Law Enforcement 71The Economist, August 20, 2016 But what about perpetuating bias against minorities?
  72. 72. Summary • Cannot trust the given data completely • Ground truth is often neither true nor grounded • Data may have bias • Look for additional data that can improve model • Quality/cost tradeoff? • Generate your own training data! • E.g., Polarr photo-editing app • Data Programming (Ratner et al, 2016) 72
  73. 73. CONCLUSION 73
  74. 74. Summary • 5 generations of data-driven applications • Lessons and Opportunities 1. The Age of the Intelligent App 2. Disruption vs Optimization 3. Human-Machine Collaboration 4. Rise of the Cyborg 5. The Data is not a Given 74
  75. 75. Identity Crisis? 75 Data Management Semantic Web Machine Learning Data Mining Information Retrieval AI Systems Panel at NorCal DB Day, 2016
  76. 76. Marketing Myopia 76Marketing Myopia, Theodore Levitt. HBS Case Study, 1960
  77. 77. Data impacts every human endeavor 77 Data Entertainment Transportation Government ManufacturingSciences Education Security Commerce
  78. 78. Data + X • Core identity of the field is to create value from data • Never a better time for it! • Data is now a key part of every field of human endeavor • Stanford CS+X • The value of being an outsider 78
  79. 79. Go Forth And Disrupt! 79 Entertainment Transportation Government ManufacturingSciences Education Security Commerce
  80. 80. ANNOUNCEMENT 80
  81. 81. IIT Madras CS Visiting Chair Program • Focus area: data-driven approaches to tackle important problems • Leading faculty/researchers from around the world welcome! • Flexible time commitment • Minimum 2 weeks • Endowed by Venky Harinarayan and Anand Rajaraman 81
  82. 82. Confirmed Visiting Chairs so far… 82 Jeff Ullman Professor Emeritus, CS Stanford Randy Katz Distinguished Professor, EECS UC Berkeley Hari Balakrishnan Professor, EECS MIT
  83. 83. For more information deaniar@iitm.ac.in 83 Prof. Nagarajan
  84. 84. Thanks! Anand Rajaraman datawocky@gmail.com @anand_raj

×