Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupPresentation Transcript
Data Science Consulting
Science meets business, again.
Third time a charm?
March 17, 2014
Postdocs drive the worlds economy –
Young scientists become…
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil,
Australia, China - over 30 worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Theme of Ken’s book
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies. We
can do it differently.
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
What we do
Recommendation systems for a
retailer customer. Our Bayesian
Healthcare group purchasing
• Problem is matching medical
products by text description. Fuzzy
• In place solution. Rules engine.
Complicated. 60% match rate, one
day required for run
• In 3 weeks we delivered a
lightweight solution in python. >80%
match rate, runtime of a few
minutes (on a laptop).
• Later moved to Elastic Search for
even better results.
What exactly is data
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Science: Physics, Astronomy, Biology
Why now? : Data scientist productivity growth
crosses critical threshold for new job creation
• Salary increase over postdoc requires
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
• Nothing magical happened in 2005!
Productivity Drivers for Data-
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
• Growing importance of
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
So what is data science now
My definition of data science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
Misses only the first one
Are we there yet?
Overhyped, underhyped, mis-
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
“Math is not a fad”
- Aaron Erickson , ThoughtWorks
Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
$9 billion per byte!
Data science consulting
• Always something new,
• Exposed to many different
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
• Your clients choose you
• People problems often
more important than math
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
Challenges in data science
• Business’s don’t yet
understand the terminology,
process or techniques. Much
• Visionary CEO sends you into
a not-so-visionary environment
• Problems can be vague
• Communication with business
stakeholders takes much of
• We are still developing an
effective model. More than just
• “Built us a platform for analytics
so we can become a data-
driven company” Non-sequitur
• Wanting prediction of the un-
• Attempting to use ML on noisy
• When incentives and opinions
are all over the map
• Convinced that the problem
has been solved 20 years ago.
E.g. linear regression,
segmentation model, SAS.
Common challenges Red flags
Keep offering up bold
• Look for ways for major
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
• Everybody laughed at
them! Data science is NOT going to be