Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enabling Your Data Science Team with Modern Data Engineering

Presented at MinneBOS 2019 by James Densmore

  • Login to see the comments

  • Be the first to like this

Enabling Your Data Science Team with Modern Data Engineering

  1. 1. Enabling Your Data Science Team with Modern Data Engineering James Densmore Data Liftoff @jamesdensmore
  2. 2. About Me Founder & Consultant at Data Liftoff Experience leading Data Science and Data Engineering Teams Technical Background (Software Engineering and Data Engineering) @jamesdensmore
  3. 3. What is “Modern” Data Engineering? ● Thanks to highly scalable, columnar databases (usually cloud based), we’re now able to store, structure and query, extremely high volumes of data at a low cost. Really! ● A mix of data lakes and data warehouses ● ELT instead of ETL ● Closer to software engineering than in the past ● No longer a “back office” function. Often aligned with product development. Sometimes a stand-alone Tech team
  4. 4. Difference Between Data Science and Data Engineering - Oversimplified! Data Engineers build and maintain data infrastructure, including data warehouses. Data Scientists use data to make predictions, run analysis and build models to power products.
  5. 5. Common Data Engineering Tools and Platforms
  6. 6. Common Data Science Tools and Platforms
  7. 7. Don’t Assume The Two Teams Understand Each Other
  8. 8. What Data Scientists Should Know about Data Engineers ● They’re software engineers at heart ● They don’t always know how data is generated. Some questions are better left to the production engineers ● They’re interested in your model, but probably not the math 😆 ● They’re thinking about scale and efficiency - sometimes too much so ● You are one of many customers to them
  9. 9. What Data Engineers Should Know about Data Scientists ● They write code, but they’re usually not software engineers ● They will look into data in more detail than anyone else, including you ● Their work is difficult to put into tickets and sprints ● Scale and performance is not their top priority ● They understand the “why” of what they’re building - just ask
  10. 10. What Data Science Needs from a Data Infrastructure ● Access to both transformed and unprocessed data ● Definitions of columns/attributes and how data is generated ● A safe space to experiment and tune models ○ Plenty of storage ○ No impact on production or other users ○ Read permissions on existing datasets, write/create space for themselves ● A path to production
  11. 11. How This Differs From Other Consumers of Data ● Data warehouses traditionally serve fully transformed and aggregated data to BI tools, dashboards and data analysts. Data Scientists need raw data - a lot of it ● The data warehouse was once the “end of the road” for data. Data Scientists need it in other forms and locations. ● Data products built by the data science team may end up in production. What’s the path to get there?
  12. 12. Asking More from Data Engineering ● New pipelines to support data science ● Documenting more detail of the raw data and fielding highly specific questions about it ● Strain on databases from ad hoc queries ● Managing data security and privacy outside of the warehouse ● Model deployment to production
  13. 13. Infrastructure Considerations Image Credit: Amazon Web Services ● Data Lakes + Databases ● Secure storage for flat files ● VMs for building and testing models in development ○ Discourage local development with sensitive data ● Share best practices for accessing data from scripts - credential management ● Data governance now extends to development machines, VMs, and flat file storage
  14. 14. An Example - Building a Recommender System ● Data to build the model ○ Previous recommendations and clicks, search logs, content metadata, user profiles, user activity history ○ What they want might not exist! ● Infrastructure to build the model ○ Storage for exports of data ○ VMs to build and run models - needs to securely access input data, and output results for analysis ● Moving model to production ○ Data engineering + application engineers ● Instrumenting further tracking and data collection in production ○ Build new pipelines and select storage ● Deploy, analyze, iterate and deploy again!
  15. 15. Partners, Not Siloed Services ● The closer together, the better! ● Over-communicate ○ Overlapping Slack channels ○ Sit in on planning meetings ● Share knowledge ○ Monthly demos or lunch-and-learns ○ Share detailed release notes ● Recognize differences in sizing, planning and executing projects Image Credit: Vector Open Stock -
  16. 16. Overcome Org Structure ● A single leader overseeing both teams, even if not directly, is ideal ○ Not always possible! Team up leaders and keep them close ● Align around projects, not org charts ● Find team members most curious about the “other side” and give them opportunities to dip their toes in ● Share, and speak to, successes as a unified team. Perception is reality
  17. 17. Other Common Pitfalls ● Hiring data scientists without having data engineers ● Assuming because you collect “data”, data scientists have what they need ● Structuring data science work like you do software and data engineering ● Underestimating the failure rate of data science projects in comparison to data engineering
  18. 18. Final Tips & Ideas ● New tools won’t save you, but don’t ignore them ● Be flexible in your hiring. Generalists bridge gaps ● Invest in light-weight documentation, and commit to keeping it current ○ Accurate over Glossy ● Cross team interviewing and onboarding ● Question your team structure often ● When in doubt, talk!
  19. 19. Thank You! @jamesdensmore