Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Upcoming SlideShare
Loading in...5

Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup






Total Views
Slideshare-icon Views on SlideShare
Embed Views



9 Embeds 306 278 9 7 4 2 2 2 1 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup Presentation Transcript

    • Data Science Consulting or Science meets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
    • Postdocs drive the worlds economy – Young scientists become… Professors
    • ThoughtWorks • Global software consulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
    • Agile Analytics at TW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Theme of Ken’s book • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. We can do it differently. • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming What we do
    • Case Studies: Recommendation systems for a retailer customer. Our Bayesian model (blue) Healthcare group purchasing Organization • Problem is matching medical products by text description. Fuzzy matching. • In place solution. Rules engine. Complicated. 60% match rate, one day required for run • In 3 weeks we delivered a lightweight solution in python. >80% match rate, runtime of a few minutes (on a laptop). • Later moved to Elastic Search for even better results.
    • What exactly is data science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
    • Is it new? Of course not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
    • Why now? : Data scientist productivity growth crosses critical threshold for new job creation • Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
    • Productivity Drivers for Data- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
    • So what is data science now My definition of data science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science. Misses only the first one
    • Are we there yet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM. “Math is not a fad” - Aaron Erickson , ThoughtWorks
    • Case study : Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
    • Data science consulting The good • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
    • Challenges in data science consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO sends you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques • “Built us a platform for analytics so we can become a data- driven company” Non-sequitur • Wanting prediction of the un- predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS. Common challenges Red flags
    • Keep offering up bold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN