Data Science Consulting
or
Science meets business, again.
Third time a charm?
David Johnston
ThoughtWorks
March 17, 2014
Postdocs drive the worlds economy –
Young scientists become…
Professors
ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil,
Australia, China - over 30 worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Theme of Ken’s book
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies. We
can do it differently.
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
What we do
Case Studies:
Recommendation systems for a
retailer customer. Our Bayesian
model (blue)
Healthcare group purchasing
Organization
• Problem is matching medical
products by text description. Fuzzy
matching.
• In place solution. Rules engine.
Complicated. 60% match rate, one
day required for run
• In 3 weeks we delivered a
lightweight solution in python. >80%
match rate, runtime of a few
minutes (on a laptop).
• Later moved to Elastic Search for
even better results.
What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology
Why now? : Data scientist productivity growth
crosses critical threshold for new job creation
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
So what is data science now
My definition of data science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
Misses only the first one
Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
“Math is not a fad”
- Aaron Erickson , ThoughtWorks
Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
Challenges in data science
consulting
• Business’s don’t yet
understand the terminology,
process or techniques. Much
teaching involved
• Visionary CEO sends you into
a not-so-visionary environment
• Problems can be vague
• Communication with business
stakeholders takes much of
your time
• We are still developing an
effective model. More than just
agile techniques
• “Built us a platform for analytics
so we can become a data-
driven company” Non-sequitur
• Wanting prediction of the un-
predicable
• Attempting to use ML on noisy
data
• When incentives and opinions
are all over the map
• Convinced that the problem
has been solved 20 years ago.
E.g. linear regression,
segmentation model, SAS.
Common challenges Red flags
Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them! Data science is NOT going to be
productized.
FIN

Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

  • 1.
    Data Science Consulting or Sciencemeets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
  • 2.
    Postdocs drive theworlds economy – Young scientists become… Professors
  • 3.
    ThoughtWorks • Global softwareconsulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
  • 4.
    Agile Analytics atTW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Theme of Ken’s book • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. We can do it differently. • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming What we do
  • 5.
    Case Studies: Recommendation systemsfor a retailer customer. Our Bayesian model (blue) Healthcare group purchasing Organization • Problem is matching medical products by text description. Fuzzy matching. • In place solution. Rules engine. Complicated. 60% match rate, one day required for run • In 3 weeks we delivered a lightweight solution in python. >80% match rate, runtime of a few minutes (on a laptop). • Later moved to Elastic Search for even better results.
  • 6.
    What exactly isdata science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
  • 7.
    Is it new? Ofcourse not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
  • 8.
    Why now? :Data scientist productivity growth crosses critical threshold for new job creation • Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
  • 9.
    Productivity Drivers forData- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
  • 10.
    So what isdata science now My definition of data science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science. Misses only the first one
  • 11.
    Are we thereyet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM. “Math is not a fad” - Aaron Erickson , ThoughtWorks
  • 12.
    Case study :Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
  • 13.
    Data science consulting Thegood • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
  • 14.
    Challenges in datascience consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO sends you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques • “Built us a platform for analytics so we can become a data- driven company” Non-sequitur • Wanting prediction of the un- predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS. Common challenges Red flags
  • 15.
    Keep offering upbold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN