Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Hybrid Approach to Data Science Project Management

Loading in …3

Check these out next

1 of 17 Ad

A Hybrid Approach to Data Science Project Management

Download to read offline

A talk about how Civis Analytics, a data science consultancy and software company, does project management using a blend of approaches from academia, consulting, and software engineering.

A talk about how Civis Analytics, a data science consultancy and software company, does project management using a blend of approaches from academia, consulting, and software engineering.


More Related Content

Slideshows for you (20)

Similar to A Hybrid Approach to Data Science Project Management (20)


A Hybrid Approach to Data Science Project Management

  1. 1. Building a Data-Driven WorldTM Open Data Science Conference A Hybrid Approach to Data Science Project Management Elaine Lee @elaineklee
  2. 2. 2Open Data Science Conference#ODSC Organizations want to be data-driven but many obstacles stand in their way: • Communication not trickling up to executives and key decision makers • Silos between departments, making it difficult to share and collaborate on analysis • Data ingestion (ETL or Extract-Transform-Load) is difficult and time-consuming • Lack of meaningful, yet customizable visual reporting • Inability to flexibly scale up or down technological needs at a reasonable cost • Inadequate or overwhelming learning resources about data science A Common Problem With Many Faces
  3. 3. 3Open Data Science Conference#ODSC Where should Enroll America direct its insurance signup efforts? Mapping the Uninsured in America
  4. 4. 4Civis Analytics | Proprietary and Confidential As a company, Civis traces its origins to the 2012 Obama for America analytics team. We built a scientific understanding of each voter. Our data science influenced every strategy and tactic: voter targeting, messaging, media buys, and fundraising. This meant the campaign could allocate resources where impact would be greatest. We ran the first individualized presidential campaign Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
  5. 5. 5Civis Analytics | Proprietary and Confidential Today, we leverage data science to help our clients in politics, non- profits, and the corporate world. Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
  6. 6. Open Data Science Conference#ODSC Open Data Science Conference#ODSC An easy-to-use, end-to-end, incredibly extendable, data science platform in the cloud for teams who want to make great data-driven decisions to drive their organizations forward. Introducing Civis
  7. 7. 7Open Data Science Conference#ODSC The Civis Approach ProductConsulting R&D Applied Data Science • Tackles the toughest data science problems we can find Data Science R&D • Generalizes and automates the solution for many scenarios Software Engineering • Integrates solutions into user-empowering software • Highly collaborative departments • All departments contribute to both our services arm and product development
  8. 8. 8Open Data Science Conference#ODSC The Civis Approach Our unique team structure allows us to solve your biggest problems with custom solutions and the technology to scale them.
  9. 9. 9Open Data Science Conference#ODSC Strategies and philosophies • Teams based on Civis’s product and consulting needs: • “Built around code” • Semi-annual departmental day-long off-sites to plan upcoming R&D initiatives • Academia-influenced: evidence-based approaches to finding and reporting best solutions • Software development-influenced: standups, code review • Favorite tools: Data Science R&D R&D Modeling Methodology Unstructured Data Engineering
  10. 10. 10Open Data Science Conference#ODSC Tools • Share and discuss data science news • Receive feedback from colleagues using our tools • Discuss implementation • Lower communication costs compared to email Data Science R&D
  11. 11. 11Open Data Science Conference#ODSC Tools • Prototype new workflows • Used like a log book to record and present results • Share preliminary results with members of other departments Data Science R&D
  12. 12. 12Open Data Science Conference#ODSC Tools • Department heads set milestones, check progress, and make project staffing decisions • Collaboratively plan development on new functionality or organizational processes (e.g. recruiting) Data Science R&D
  13. 13. 13Open Data Science Conference#ODSC Tools Strategies • Designate “tag team” on R&D as default R&D resources for client engagements • This is the Modeling Methodology team • Other R&D teams’ members may be staffed on engagements depending on expertise required • R&D team member always serves as the Consulted in the RACI model • Transparency about challenges is paramount R&D <-> ADS
  14. 14. 14Open Data Science Conference#ODSC 1. Assemble a project team of R&D data scientists and Applied Data Scientists 2. Work with Enroll America to refine requirements and come up with a plan of analysis, ultimately resulting in the design and execution of a phone survey on a sample of individuals, followed by building a predictive model for the rest of the country. 3. The Applied Data Science Manager has weekly calls with Enroll America and status meetings with the project team. 4. The project team delivers the predictions and analysis to Enroll America. R&D <-> ADS: A Case Study Mapping the Uninsured in America The project team completes a postmortem and determines these activities could be automated: model building
  15. 15. 15Open Data Science Conference#ODSC Tools Strategies • Designate teams at the interface to triage issues and plan new development: • R&D: “Engineering” team • Tech: “Modeling” team • Use module or project-specific chatrooms to get answers to ad-hoc questions quickly • Identify opportunities to form cross- functional teams, e.g.: • Developing apps using the Platform’s API • Knowledge sharing on best practices R&D <-> Tech
  16. 16. 16Open Data Science Conference#ODSC 1. After the postmortem for the Enroll America engagement, R&D begins prototyping automated modeling functionality and discussing its implementation with the Tech department. 2. R&D’s Engineering team finishes the prototype and works with Tech’s Modeling team to integrate it as a new feature in the Platform. 3. During integration, ad hoc discussions occur on GitHub and Hipchat to address usability questions, e.g. resource usage and input/output specifications. R&D <-> Tech: A Case Study Mapping the Uninsured in America The integration team successfully builds and integrates the Build Model module in the Platform.
  17. 17. Open Data Science Conference#ODSC Our approach to data science consulting and product development is enriched by valuable perspectives of our employees, who come from a wide array of backgrounds, making our project management strategies a hybrid of more conventional techniques. Conclusion

Editor's Notes

  • Hi everyone, it’s great to be here. My name is Elaine Lee.

    I am a Data Scientist in the R&D department at Civis Analytics.

    Civis is a Chicago-based data science consulting and software startup, and I’m excited to tell you a little bit about our company and the work that we do.

    In particular, I’ll be talking about how the R&D department juggles concurrent development of both our consulting services and our cloud-based data science platform.

    I’ll be emphasizing approaches borrowed from other more established industries as it pertains to department projects as well as interdepartmental collaborations.
  • Many of you are already familiar with data science and the potential it has to change the way things are done.

    However, data science has a high barrier of entry for some teams, from a technical standpoint and organizational standpoint.

    It can be difficult to wrap your head around the technical needs and quantitative concepts that go into data science.

    In addition, it can be hard to assemble the right team to do data science and to keep the work organized.

    Picture a team of data scientists working on the same project. Some of them have written R or Python scripts to process the data, do feature engineering, and build models on it.

    Some of them have taken the results of the models and produced charts and visualizations in Excel, Tableau, or D3.

    All the work is being kept in a few different places – Dropbox, Google Drive, Github, MySQL, …

    It is difficult for this hypothetical team to figure out what exactly has been done, and even worse, what efforts have been duplicated.

    It is also incredibly difficult to validate the analysis.

    Does this sound familiar to anyone?

    Fortunately, many of us at Civis Analytics have faced these challenges in our previous work, but we’ve made those challenges a thing of the past!

    It didn’t happen overnight, but we were constantly coming up with new ideas to improve the data science workflow by, well, working on a variety of consulting projects and researching new methods.

    Today I will talk about what some of these ideas are.

    In addition, I will tell the story of how one client engagement provided us a valuable exercise in collaboration and data science best practices we’ve internalized.
  • Throughout my talk today, I will be using our project with Enroll America to illustrate a lot my concepts.

    Enroll America was one of our first clients in 2013.

    They wanted our help identifying Americans without health insurance so they knew where to direct their outreach.

    This was a challenging problem because of its large scope – they want to do outreach throughout the country! – and it wasn’t obvious what’s predictive of being uninsured.

    Why did Enroll America specifically seek us out to solve this problem?
  • Let’s talk a little about what expertise Civis has for tackling problems like Enroll America’s.

    The founding members of Civis Analytics were part of Obama For America’s analytics team in his 2012 re-election campaign.

    There, we developed the beginnings of a framework for doing person-level analytics (which is highly relevant for Enroll America).

    With scientific levels of rigor, we built models to understand all sorts of relevant vote-related behaviors in order to better identify and persuade supporters, which translated to optimizing how the campaign’s resources were used.

    The campaign spanned many months and during that time, lots of models were being built and refined; their results were constantly being sent to those in the field to take action upon.

    Developing an organized and repeatable workflow was especially crucial in order to minimize costs, time spent – especially since the staff was small, and any inadvertent human error, especially when models are built at such a large scale.
  • After the campaign ended in 2012, we re-examined the strategies we employed and the problems they solved.

    We realized that if we generalized them, we could solve similar problems for clients in the political, non-profit, and corporate worlds.

    Which is exactly what Civis did.

    What you see here is a sample of clients, in addition to Enroll America, that we have helped better target their advertising dollars, identify potential customers for greener sources of electricity, and determine public awareness and sentiment on their brand or cause.

    In the past year, we took it a step further and we formed a partnership with Discovery Communications to inform more sophisticated audience targeting approaches, ratings forecasting, and marketing spend.

    We anticipate making more partnerships like this in the future.

    The examples I gave are all problems with a similar flavor to what Civis successfully solved in 2012 – identifying and reaching the people you care about most.
  • Our diverse client portfolio, innovative approaches, and proven track record have made Civis Analytics’ consulting services highly sought after in the predictive analytics space.

    However, we’re equally passionate about removing obstacles to doing data science.

    Our steady client pipeline enables us to formalize our approach in the form of a cloud-based data science application.

    Our software, Civis, or “the Platform”, supports the entire workflow of a typical data science project, from data warehousing to data processing to predictive modeling to reporting.

    This enables organizations to easily take control of their own data and unlock their insights.
  • This is how we turn our client work experiences into software.

    We select novel problems brought forth by our clients and work with them to deliver a solution.

    This is primarily addressed by our Applied Data Science department.

    Simultaneous to this, we’ve been conducting research and experimenting with different methods to solve the problem, with one eye towards determining how to generalize the solution.

    This is primarily done by the Data Science R&D department.

    Finally, solutions are integrated into our software platform by the Software Engineering, or Tech, department.

    Users of our software platform – clients and our Applied Data Scientists – provide us valuable feedback which are continuously incorporated.

    This unique, synergistic cycle enables us to deliver high quality results to our customers.
  • In our day-to-day work, all departments pitch in on both lines of business, ensuring fluency on all the company’s offerings and thus better decision making.

    We also collaborate across departments on all projects, big or small.

    Today I will be focusing on how my department, the DS R&D department, manages its workload and how it works with the Applied Data Science and Tech departments.
  • The R&D department is the only department that is intimately aligned with both lines of business.

    We’re split into 3 different teams.

    Modeling Methodology focuses on developing new modeling workflows.

    Unstructured data specializes in data that can’t neatly be summarized by a flat file, like text data.

    Engineering is responsible for managing our production codebases of new features for our software product.

    Our department is “built around code”:

    “We're trying to build up knowledge and best practices, and being built around code lowers our communication costs, errors, redundancy, and facilitates us making software.”

    To roadmap what we build, based on what we’ve learned from recent client engagements, we have day-long semi-annual department off-sites.

    When developing new methodologies, we use an academic-influenced approach – empirical and thorough such that our recommended solution covers all the edge cases.

    When building out workflows, we follow guidelines common to most software development projects, including some ideas from the Agile methodology – we have daily standups to make sure everyone’s on the same page about the status of the codebases and we do code reviews before any changes are shipped. Our standups are on a per-repository basis, so it doesn’t waste anyone’s time.

    To do our work, these are our favorite tools. Let’s take a look at how we use them.
  • Hipchat and Github form the backbone of our communications.

    To those not familiar with these tools, Hipchat is an instant messaging tool for organizations.

    Github is a web interface, built on top of the version control system, git, for teams to collaborate on a codebase.

    These tools are crucial to our philosophy on being built around code

    They enable members across the company to participate by asking questions and generally weighing in

    Departmental members use it to discuss implementation

    These tools are much faster than email since it makes it easier to ask questions and get answers, since anyone who knows the answer can see the request and thus respond.
  • When developing new methods, we like to use Jupyter and Google Drive.

    We use Jupyter for its Ipython Notebook capabilities.

    It allows us to run Python code, especially modules from our codebase, interactively – it allows us to chain components together to make new workflows.

    Jupyter also has presentation functionality, so we also use it as a log book to record and present results in internal meetings.

    Sometimes we also use Google Drive to record and share results with members of other departments, such as Applied Data Scientists, who have a vested interest in the project but don’t require all the details.
  • Finally, to take the “pulse” on the R&D department as a whole, department heads use Google Drive and Asana for big picture planning.

    Asana is a project management tool which gives department heads a birds eye view of what each team member is working on and how each project is progressing.

    Google Drive tools are used to collaborate on planning documents, be it plans for new functionality to build or revising organizational processes, such as rewriting our hiring exam.
  • That was how we, the R&D department, work together.

    How do we work with the Applied Data Scientists, the data scientists in our consulting arm?

    To make project staffing seamless, we designate a tag team to serve as the first point of contact for client engagements.

    This is the Modeling Methodology team.

    However, other R&D data scientists may be staffed on a project depending on expertise required.

    The R&D data scientist always serves as the Consulted in the RACI model.

    The RACI model is a popular project management model used in consulting.

    It emphasizes explicit roles for each team member to ensure accountability.
    R is for Responsible, a role held by the applied data scientists.
    A is for Accountable; this is the Applied Data Science Manager or project manager
    C is for consulted.
    And I is for Informed (the client)

    Lastly, we are open with Applied Data Scientists about R&D challenges in order to avoid schedule slips on the client engagement.

    The project plan is often tracked in Trello, a popular bulletin board app, with bulletin boards for each milestone’s requirements.
  • Let’s revisit our client story – Mapping the Uninsured in America – to illustrate concretely how we work together.

    After Enroll America shared their problem to us, we assembled a project team of R&D data scientists and Applied data scientists to solve it.

    We worked with Enroll to refine the problem statement into a set of requirements, ultimately resulting in the design and execution of a phone survey on a sample of individuals, followed by building a model to capture the rest of the country.

    The project gets under way.

    Throughout the project, the Applied Data Science Manager has weekly status calls with Enroll and with the project team to make sure we’re on schedule.

    Occasionally we staffed a couple extra data scientists to the project to make sure we delivered results on time when there was risk of a schedule slip.

    For example, we brought in an extra data scientist towards the end of the project to help produce graphs and visualizations of the results.

    Finally, we finished our analysis and presented our predictions to Enroll America.

    Afterwards, we did a post mortem and realized that automated model building would’ve made us more efficient.

    This is because we conducted our experiment in waves and built similar models as the results came in, with the only difference being the input data.

    Also, the analysts were each working on individual components of the analysis, writing their own R scripts which had a lot of overlap (such as the data processing steps), which meant a lot of time was wasted.
  • So that’s how we work with the Applied Data Scientists on consulting projects.

    How do we work with the Tech department?

    Much like how we work with the Applied Data Science department, we’ve designated a team to interface with the Tech department and they have as well.

    That would be the Engineering team on our side and the Modeling team on their side.

    The Engineering team in Data Science are data scientists who speak software development and the Modeling team in the Tech department are software engineers who speak data science.

    Most of our communications are done using module or project-specific chatrooms and github issue tickets, which gets answers quickly.

    To promote really inspired product development, we identify opportunities to form cross-functional teams,

    Such as using the Platform’s API to develop new apps

    And teaching each other best practices for software development via brownbag sessions.

  • Let’s revisit the Enroll America project for an example of how the R&D data scientists work with the software engineers.

    After the post mortem for the Enroll engagement, we began prototyping automated modeling functionality, communicating to the Tech department the motivation for it and including them in discussions about implementation and feasibility.

    Once we finish the prototype, ensuring that it passes all the tests and code review, the Engineering team in R&D work with the Modeling team in Tech to integrate it as a new feature in the Platform.

    We use Github and Hipchat to discuss questions that come up, such as resource usage, input/output specifications, and data visualizations we wanted to provide to the end user.

    Together, the R&D department and the Tech department successfully built and integrated the Build Model module that exists today in Platform.
  • In summary, a lot of our approaches have a common theme, which is minimizing communication costs within the R&D department and with other departments.

    This is evidenced by our embrace of some free or open-source tools for collaboration and our general belief in transparency about challenges.

    We also emphasize collaborative opportunities between departments to strengthen our cohesiveness as a team, be it working on a client engagements together or learning best practices in a seminar format.

    A lot of our ideas come from the valuable perspectives of our employees, who come from a wide array of backgrounds.

    Thus, our project management strategies are a hybrid of techniques seen in more established industries such as software engineering, consulting, and academia.

    I hope the tips presented in my talk today has made doing data science more manageable for your team.

    Thank you for your time.