This presentation is a introduction of structures and steps for building a Data Science Team inside an Enterprise.
- Data Science Team,
- Standardized project structure,
- Execution of data science projects
- Azure Machine Learning Workbench
2. Few things about me
I am a GIS Solution Architect
Microsoft Professional Program Data Science Certificate
Do Azure since 2010
Open Source developer and contributor
• React-Leaflet-Google (npm downloads > 26,500)
• Geotrellis (geographic data processing engine for Spark)
• Magellan (Geo Spatial Data Analytics on Spark)
3. Agenda
• Introduction
• Data Science Team
• Standardized project structure
• Execution of data science projects
• Azure Machine Learning Workbench
4. Introduction
What is Data Science?
Data science is an interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.
Source: Wikipedia
6. Introduction
What are the characteristics of a project at Enterprise stage?
1. I am not alone, I am part of a Team.
2. The deliverables should be reusable and production ready.
3. Need for scale up.
7. Introduction
How can we take Data Science to Enterprise level?
Follow the 3 principles:
1. The Team writes experiments
2. The Team members keep their work as simple as possible
3. The Team members collaborate and share experiments, ALL THE TIME!!!!
8. Data Science Team
Data science functions in enterprises are organized:
1. Data science group/s
2. Data science team/s within group/s
9. Data Science Team
Roles in Data Science Group:
• Project Individual Contributor. Data Scientist, Business Analyst, Data
Engineer, Architect, etc. A project individual contributor executes a data
science project.
• Project Lead. A project lead manages the daily activities of individual data
scientists on a specific data science project.
• Team Lead. A team lead is managing a team in the data science unit of an
enterprise.
• Group Manager. Group Manager is the manager of the entire data
science unit in an enterprise.
10. Data Science Team
Tasks in Data Science Group:
Group
Manager
Team
Lead
Project
Lead
Data
Scientist
1. Create Group
Account on a Version Control Platform
2. Create Team
Environment
3. Create Project
4. Add Storage/ Analytics
Resources to Project
Merge Pull
Request
5. Execute
Project
13. Standardized project structure
Azure-TDSP-ProjectTemplate
Project Charter
• Business background
• Scope
• Personnel
• Metrics
• Plan
• Architecture
• Communication
Exit Report
• Overview
• Business Domain
• Business Problem
• Data Processing
• Modeling, Validation
• Benefits
• Learnings
14. Standardized project structure
We need standards
ONNX (http://onnx.ai/) is a open format to represent deep learning
models. With ONNX, AI developers can more easily move models
between state-of-the-art tools and choose the combination that is best
for them.
ONNX is developed and supported by a community of partners
Facebook and Microsoft
15. Execution of data science projects
What is an experiment?
An experiment is a Study.
16. Execution of data science projects
Macroscopically
Introduction Main Part Conclusion
17. Azure Machine Learning Workbench
What is that?
It’s an integrated end-to-end Data Science Solution.
Requirements
• Create Azure Machine Learning services account
(https://bit.ly/2x1yWu0 )
Typically, a data science project is done by a data science team, which may be composed of project leads (for project management and governance tasks) and data scientists or engineers (individual contributors / technical personnel) who will execute the data science and data engineering parts of the project.
Definition of four TDSP roles
With the above assumption, we have specified four distinct roles for our team personnel:
Project Individual Contributor. Data Scientist, Business Analyst, Data Engineer, Architect, etc. A project individual contributor executes a data science project.
Project Lead. A project lead manages the daily activities of individual data scientists on a specific data science project.
Team Lead. A team lead is managing a team in the data science unit of an enterprise. A team consists of multiple data scientists. For data science unit with only a small number of data scientists, the Group Manager and the Team Lead might be the same person.
Group Manager. Group Manager is the manager of the entire data science unit in an enterprise. A data science unit might have multiple teams, each of which is working on multiple data science projects in distinct business verticals. A Group Manager might delegate their tasks to a surrogate, but the tasks associated with the role do not change.
Note: Depending on the structure in an enterprise, a single person may play more than one roles OR there may be more than one person working on a role. This may frequently be the case in small enterprises or enterprises with a small number of personnel in their data science organization.
This is a general project directory structure for Team Data Science Process developed by Microsoft. It also contains templates for various documents that are recommended as part of executing a data science project when using TDSP.
Team Data Science Process (TDSP) is an agile, iterative, data science methodology to improve collaboration and team learning. It is supported through a lifecycle definition, standard project structure, artifact templates, and tools for productive data science.
NOTE: In this directory structure, the Sample_Data folder is NOT supposed to contain LARGE raw or processed data. It is only supposed to contain small and sample data sets, which could be used to test the code.
Code folder for hosting code for a Data Science Project
This folder hosts all code for a data science project. It has three sub-folders, belonging to 3 stages of the Data Science Lifecycle:
Data_Acquisition_and_Understanding
Modeling
Deployment
Folder for hosting all documents for a Data Science Project
Documents will contain information about the following
System architecture
Data dictionaries
Reports related to data understanding, modeling
Project management and planning docs
Information obtained from a business owner or client about the project
Docs and presentations prepared to share information about the project
The two documents under Docs/Project, namely the Charter and Exit Report are particularly important to consider. They help to define the project at the start of an engagement, and provide a final report to the customer or client