Data Science Consulting
or
Science meets business, again.
Third time a charm?
David Johnston
ThoughtWorks
March 17, 2014
Young scientists
become…
Professors
Talk Overview
• Agile Analytics group at ThoughtWorks
• What is data science anyway? Origins and future.
Good or evil?
• Guide to technologies and limits to technology
• Process and methodology for successful data
science consulting
ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran,
Dallas, India, Brazil, Australia, China - over 30
worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
The three pillars
Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Themes
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies.
• We can do analytics in a agile, fast, light-footprint
way.
What do we do?
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
Our main goals
• Use data analysis to give companies an edge in their marketplace
• Use data analysis to improve the world at large
Some typical projects
• Recommending Systems
• Customer behavior analysis
• Optimization
• Efficient algorithms/tech for massive data sets
• Company specific analytics challenges
Case Study 1: HealthCare Group
Purchasing Organization
• One of the largest GPOs. 1000s of client hospitals
• Hospital sign up, pay fee and get group-
purchasing discounts
• The GPO has to make estimates to hospitals on
their likely savings.
• Hospital’s data is usually in a non-standard
spreadsheet. No SKUs in healthcare (yet).
• A data matching mess
Case Study 1: HealthCare Group
Purchasing Organization
GPO: Johnson & Johnson Sterile Scalpel #F8-505
Hospital: J&J scalpel, steel item f8505 size 3’’
• Their in-place solution – Oracle, lots of ETL tools,
using SQL with lots of rigid rules for how to match.
• Data-base of matching rules was difficult to maintain
• Accuracy of matching ~60%. Rest was done by hand.
Took 1 day for processing and weeks for lines done by
hand.
Case Study 1: HealthCare Group
Purchasing Organization
What we did
• First convince them that their solution was highly
inefficient.
• Wrote python program using a tree data structure and
machine learning to do matching.
• Ran on my laptop in a few minutes. Match rates > 80%
• This done in 3 weeks. Later settled on a solution using
Elastic Search.
Case Study 2: Retail Rec Systems
• Customer providing
coupons to retailer
customers
• Needed a better
recommendation system
• We’re using a simple
logistic regression model
What exactly is data
science?
• Is this really new?
• Does the term “data science” make any sense?
• Is it just a fad? Over-hyped?
• Why did this term just become popular a few years back?
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this?
What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology
Isn’t there anything new?
Of course
• Analytics finally becoming ubiquitous in business (as it always should have been)
• Much more communication between disparate fields
• It’s finally work that’s fun
Ok, but why now?
It’s a big movement so lets give it a new name , Data Science
Why now? - Productivity
• There has always been plenty of data science in
science
• Job prospects in academia are slim
• Productivity has been rising much faster than
postdoc salaries and scientist job creation
Data scientist productivity
growth
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
Then and now
1990s data science
• Writing code in C/C++
• Working with flat files
• Even relational/SQL is
new
• Using Matlab, IDL
proprietary software
• Writing all algorithms from
scratch. Slow. Buggy.
Data science today
• Working in high level open-
source languages Python, R
• We’re good at SQL and
have lots of other options
NoSQL
• Git, thousands of libraries
available. Easy to install.
• Can concentrate more on
what we’re good at.
So what is data science now
Data Science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
Where is it going?
• Big Data technology is separated from data science
• Software developers take over much of Big Data roles
• Businesses begin to understand data science terminology like
they now understand software terminology and they are not
Twitter.
• Data scientists and businesses find a methodology that works
like industrial scale software development has
Where is it going?
Specialization
• Most experienced data scientists move into consulting or
management of teams
• Universities graduate many “data scientist-lite” students from
new more specialized BS or MA programs
• Fewer generalists
• PhD students need to learn additional skills. Not instant hires
(http://bit.ly/1m3krq6)
Why won’t we have 100x
more data scientists in N
years?
• Pool of disgruntled postdocs will dry up or “I am
not even supposed to be here!”
• Many data science problems don’t need the most
cutting edge tools. (Some do).
• People rarely get much experience working with
real data in academic settings. Requires real-
world experience, takes time.
Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
Why Big Data enthusiasm
might peak soon
Big Data defined – Process for performing calculations on data
that:
• Cannot possibly be done on a single machine
• When sampling and streaming are not effective
• What data-reduction is not possible
• When storage and compute are closely balanced
• Parallelizing is absolutely unavoidable
Most tasks are not like this
• Sampling is usually good enough for training machine learning
• Need for rapid feedback, interactive work
• CPUs are underutilized. IO limited.
• Usually a better algorithm can solve the problem better
Hadoop (Spark)
Good use cases
• Large batch jobs like:
restructuring and reducing
data from raw files.
• Scoring with ML models
• When you have to do
something on every data
point.
• Raw storage in HDFS
Bad use cases
• Model development
• Visualization
• Brute-forcing an inefficient
algorithm.
• Treating Hadoop like a
data-base.
The data-sizes we typically
see
Most companies have a few million customers 10^7
Often they storage ~ 1000 items per customer
That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on
our laptops (but not in memory). Such data can be moved to the cloud if
need be in 1-2 days.
Often we can be productive with either a sample or an aggregation.
True when
• Customer specific items are things like purchases, manually entered
text, logins etc.
Not true when
• Things are web-events, pair-wise interactions (i.e. graphs, social)
Sources of really big data
Sensor data
• Pictures
• Video
• Health monitoring devices
• Internal device monitors
• Results of combinatorical-
complexity
However
• Is it really economic to
store and process these
huge data sets to begin
with?
• Will learn to utilize
streaming algorithms
• Will learnt on focus on
information not noise
Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
Challenges in data science
consulting
• Business’s don’t yet understand the terminology,
process or techniques. Much teaching involved
• Visionary CEO send you into a not-so-visionary
environment
• Problems can be vague
• Communication with business stakeholders takes
much of your time
• We are still developing an effective model. More than
just agile techniques
Red flags to avoid
• “Built us a platform for analytics so we can
become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the
map
• Convinced that the problem has been solved 20
years ago. E.g. linear regression, segmentation
model, SAS.
Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them!
Data science is NOT going to be
productized.
FIN

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

  • 1.
    Data Science Consulting or Sciencemeets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
  • 2.
  • 3.
    Talk Overview • AgileAnalytics group at ThoughtWorks • What is data science anyway? Origins and future. Good or evil? • Guide to technologies and limits to technology • Process and methodology for successful data science consulting
  • 4.
    ThoughtWorks • Global softwareconsulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
  • 5.
  • 6.
    Agile Analytics atTW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Themes • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. • We can do analytics in a agile, fast, light-footprint way.
  • 7.
    What do wedo? • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming Our main goals • Use data analysis to give companies an edge in their marketplace • Use data analysis to improve the world at large
  • 8.
    Some typical projects •Recommending Systems • Customer behavior analysis • Optimization • Efficient algorithms/tech for massive data sets • Company specific analytics challenges
  • 9.
    Case Study 1:HealthCare Group Purchasing Organization • One of the largest GPOs. 1000s of client hospitals • Hospital sign up, pay fee and get group- purchasing discounts • The GPO has to make estimates to hospitals on their likely savings. • Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet). • A data matching mess
  • 10.
    Case Study 1:HealthCare Group Purchasing Organization GPO: Johnson & Johnson Sterile Scalpel #F8-505 Hospital: J&J scalpel, steel item f8505 size 3’’ • Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match. • Data-base of matching rules was difficult to maintain • Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.
  • 11.
    Case Study 1:HealthCare Group Purchasing Organization What we did • First convince them that their solution was highly inefficient. • Wrote python program using a tree data structure and machine learning to do matching. • Ran on my laptop in a few minutes. Match rates > 80% • This done in 3 weeks. Later settled on a solution using Elastic Search.
  • 12.
    Case Study 2:Retail Rec Systems • Customer providing coupons to retailer customers • Needed a better recommendation system • We’re using a simple logistic regression model
  • 13.
    What exactly isdata science? • Is this really new? • Does the term “data science” make any sense? • Is it just a fad? Over-hyped? • Why did this term just become popular a few years back? • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this?
  • 14.
    What exactly isdata science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
  • 15.
    Is it new? Ofcourse not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
  • 16.
    Isn’t there anythingnew? Of course • Analytics finally becoming ubiquitous in business (as it always should have been) • Much more communication between disparate fields • It’s finally work that’s fun Ok, but why now? It’s a big movement so lets give it a new name , Data Science
  • 17.
    Why now? -Productivity • There has always been plenty of data science in science • Job prospects in academia are slim • Productivity has been rising much faster than postdoc salaries and scientist job creation
  • 18.
    Data scientist productivity growth •Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
  • 19.
    Productivity Drivers forData- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
  • 20.
    Then and now 1990sdata science • Writing code in C/C++ • Working with flat files • Even relational/SQL is new • Using Matlab, IDL proprietary software • Writing all algorithms from scratch. Slow. Buggy. Data science today • Working in high level open- source languages Python, R • We’re good at SQL and have lots of other options NoSQL • Git, thousands of libraries available. Easy to install. • Can concentrate more on what we’re good at.
  • 21.
    So what isdata science now Data Science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.
  • 22.
    Where is itgoing? • Big Data technology is separated from data science • Software developers take over much of Big Data roles • Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter. • Data scientists and businesses find a methodology that works like industrial scale software development has
  • 23.
    Where is itgoing? Specialization • Most experienced data scientists move into consulting or management of teams • Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs • Fewer generalists • PhD students need to learn additional skills. Not instant hires (http://bit.ly/1m3krq6)
  • 24.
    Why won’t wehave 100x more data scientists in N years? • Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!” • Many data science problems don’t need the most cutting edge tools. (Some do). • People rarely get much experience working with real data in academic settings. Requires real- world experience, takes time.
  • 25.
    Are we thereyet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM.
  • 26.
    Why Big Dataenthusiasm might peak soon Big Data defined – Process for performing calculations on data that: • Cannot possibly be done on a single machine • When sampling and streaming are not effective • What data-reduction is not possible • When storage and compute are closely balanced • Parallelizing is absolutely unavoidable Most tasks are not like this • Sampling is usually good enough for training machine learning • Need for rapid feedback, interactive work • CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better
  • 27.
    Hadoop (Spark) Good usecases • Large batch jobs like: restructuring and reducing data from raw files. • Scoring with ML models • When you have to do something on every data point. • Raw storage in HDFS Bad use cases • Model development • Visualization • Brute-forcing an inefficient algorithm. • Treating Hadoop like a data-base.
  • 28.
    The data-sizes wetypically see Most companies have a few million customers 10^7 Often they storage ~ 1000 items per customer That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days. Often we can be productive with either a sample or an aggregation. True when • Customer specific items are things like purchases, manually entered text, logins etc. Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)
  • 29.
    Sources of reallybig data Sensor data • Pictures • Video • Health monitoring devices • Internal device monitors • Results of combinatorical- complexity However • Is it really economic to store and process these huge data sets to begin with? • Will learn to utilize streaming algorithms • Will learnt on focus on information not noise
  • 30.
    Case study :Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
  • 31.
    Data science consulting Thegood • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
  • 32.
    Challenges in datascience consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO send you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques
  • 33.
    Red flags toavoid • “Built us a platform for analytics so we can become a data-driven company” Non-sequitur • Wanting prediction of the un-predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.
  • 34.
    Keep offering upbold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN