Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Scientists Are Analysts Are Also Software Engineers

376 views

Published on

by William Whipple Neely
Director of Data Science at Electronic Arts

Data scientists and analysts write code, sometimes a lot of code, so we are also software developers as much as model builders and algorithm creators. This talk is about the challenges a team of data scientists and analysts face when trying to scale their work, to make their work repeatable and testable. I’ll talk about how our data science team is leveling-up their skills as software developers, the challenges we’ve faced and the strategies that are helping.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data Scientists Are Analysts Are Also Software Engineers

  1. 1. DATA SCIENTISTS AND ANALYSTS ARE ALSO SOFTWARE ENGINEERS W.Whipple Neely Director of Data Science, EA
  2. 2. THIS TALK IS ABOUT ….. Moving data science and analytics teams to a software development model. • The motivation is so that we can created repeatable, verifiable processes. • It also means that we can bring powerful but “personal” analysis environments (such as R) into producing enterprise level systems, to create work that typical dashboarding systems cannot achieve. • In many ways this is a story about one set of teams, it may not apply to all groups, but it has helped ours.
  3. 3. THE TYPICAL VENN DIAGRAM: WHO IS A DATA SCIENTIST Statistics SomeVersion of Domain Expertise Computer Science “hacker skills” Data Science “What kind of person does all this? What abilities make a data scientist successful?Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser.” Davenport and Patil, Data Scientist: The Sexiest Job of the 21st Century , Harvard Business Review, 2012 “Hacker skills” is the wrong term
  4. 4. Click to add call out GOOGLE IMAGE SEARCH: “WHO DATA SCIENTIST VENN DIAGRAM”
  5. 5. WHAT WE DO INSTEAD OF WHO WE ARE Engineering CollaborationScience Data Science data engineering, coding discipline, software engineering, style guides reproducibility, source code control, regression tests math, stats, computer science, machine learning, probability models, economics, “substantive domain expertise”, vast quantities of common sense Rules of engagement, empathy, communication and listening skills, flexibility, reliability, extreme social skills
  6. 6. THE PROBLEMS We have a team of data scientists who are experts at probability modeling, machine learning, and a few of them are pretty good at programming in R, Matlab or Python on a laptop. However … 1. Most have no experience of team programming 2. Many come without experience of creating software that others can use, or that is robust enough of to run 3. Creating an enterprise-level repeatable process can’t be left to the kind of programming that most of us do on our laptops 4. There is no easy intermediate step between working on a laptop and something that works on the enterprise platform.
  7. 7. WHERE WE STARTED Write R or Python Script Run Script Manually Update Report Write R or Python Script Run Script Manually Update A Static Model Implementation OR
  8. 8. THE PROBLEMS WITH WHERE WE STARTED • Code/methods/models got lost. • Lots of manual work. • No automated checks for correctness or robustness of models or predictions.
  9. 9. WE TALKED TO THE TEAMS ABOUT WHAT WAS WRONG “Our analysts are pretty good at writing scripts and generating reports, but our team needs help with the bookends: scheduling tasks and serving the reports automatically” – Colleen Chrisco, Director of Analytics, PopCap Games
  10. 10. IN TERMS OF OUR DIAGRAM Engineering CollaborationScience Data Science data engineering, coding discipline, software engineering, style guides reproducibility, source code control, regression tests math, stats, computer science, machine learning, probability models, economics, “substantive domain expertise”, vast quantities of common sense Rules of engagement, empathy, communication and listening skills, flexibility, reliability, extreme social skills
  11. 11. Click to add call out THIS WAS A LITTLE SCARY FOR SOME OF OUR TEAMS …. We’re not programmers. I don’t even know where to start I’ve never scheduled a job before.
  12. 12. Click to add call out SO, TO ANSWER THESE CONCERNS WE DID THE FOLLOWING… Perforce R Server Script Inputs: csv, DBs, URL, logs, RDS Script Outputs: csv, DBs, email, doc, pdf, html, shiny, RDS 1. Check in Code P4V, R-Checkin 2. Submit Job Schedule file, API, Web 3. Run Script Reporting, Models, ETLs, Forecasting R Script By “we did the following’ I really mean that we hired a brilliant computer scientist named Ben Weber who became part of the team. Ben learned the workflows of the team members and created this system for us.
  13. 13. WHERE IT LANDED US • We’d automated. • We’d gotten the “bookends” covered. • Many analytics teams, including the data science team are using the system. As a result … • Teams started using the technology to improve their work • Teams became more efficient: “I no longer have to be a walking dashboard.” • Astonishingly these teams now have their routine code in source control.
  14. 14. BUT IT DIDN’T SOLVE EVERYTHING • We had produced more tools, simplified tasks, but hadn’t really created a culture of being a software producing organization. • We had extended the laptop model … a little by introducing VMs that could run the code. And giving teams more tools had introduced some issues … • A proliferation of models/predictions being run without curating the processes. • People leave, and their work continues to be run automatically …. This is not always a bad thing, but it is often not a good thing either.
  15. 15. WHAT WE KNEW WE HAD TO DO NEXT We needed to make a cultural change from what is essentially “hacking” to engineering. • So, we did start hiring people with more software engineering skills. • Introduced a style guide for our R code. • We started code and project reviews. • Hired a very non-technical writer to start helping the team produce documentation on our internal Confluence site. • Start providing training in team programming, engineering, new languages (Spark, Python). • Assign some of the positions on the team to be the software/coding gurus.
  16. 16. WHAT’S NEXT • Dev/Test/Prod environments. • Upgrading our toolset to work with Rstudio Server and Git. • Pair programming: a team member with software skills as their primary background team programming with a data scientist who has focused on statistical modeling and machine learning.

×