Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

  1. 1. Data Science Consulting or Science meets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
  2. 2. Postdocs drive the worlds economy – Young scientists become… Professors
  3. 3. ThoughtWorks • Global software consulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
  4. 4. Agile Analytics at TW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Theme of Ken’s book • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. We can do it differently. • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming What we do
  5. 5. Case Studies: Recommendation systems for a retailer customer. Our Bayesian model (blue) Healthcare group purchasing Organization • Problem is matching medical products by text description. Fuzzy matching. • In place solution. Rules engine. Complicated. 60% match rate, one day required for run • In 3 weeks we delivered a lightweight solution in python. >80% match rate, runtime of a few minutes (on a laptop). • Later moved to Elastic Search for even better results.
  6. 6. What exactly is data science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
  7. 7. Is it new? Of course not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
  8. 8. Why now? : Data scientist productivity growth crosses critical threshold for new job creation • Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
  9. 9. Productivity Drivers for Data- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
  10. 10. So what is data science now My definition of data science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science. Misses only the first one
  11. 11. Are we there yet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM. “Math is not a fad” - Aaron Erickson , ThoughtWorks
  12. 12. Case study : Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
  13. 13. Data science consulting The good • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
  14. 14. Challenges in data science consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO sends you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques • “Built us a platform for analytics so we can become a data- driven company” Non-sequitur • Wanting prediction of the un- predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS. Common challenges Red flags
  15. 15. Keep offering up bold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN