Greg Werner, CEO & Founder, 3Blades.io at MLconf ATL 2017
Productive Machine Learning and Deep Learning Projects
Machine Learning (ML) and Deep Learning (DL), known holistically as Artificial Intelligence, are no longer luxuries but necessities if companies want to remain relevant n today’s market. Data driven organizations that encourage the development of ML and DL projects allow companies to create and deploy models to create predictions in real time. Even more exciting, these real time predictions allow organizations to trigger actions based on these predictions, which ultimately improves the bottom line. However, organizations struggle to incorporate ML and DL projects to create models that improve performance. This talk focuses on how companies can enable data science platforms so that data engineers, data scientists and business analysts can quickly explore data, create and test ML and DL models, and deploy to staging and production environments regardless of the language or framework used by the team and organization.
Talking points: Siloed data initiatives is a common denominator Data scientists were segregated from the rest of the organization Tooling was disparate Initially, the need to streamline Jupyter Notebook deployments with a class of students came up after many students were complaining about the time and effort involved in using specific dependencies to complete their tasks. Package managers were not enough: users also needed an integrated solution to access a consistent and reliable solution to complete their assignments using Jupyter Notebooks. We also noticed that companies, in general, did not provide a homogeneous environment for their data science teams. This led to many headaches but was considered business as usual.
Talking points: Issues encountered with the education vertical were common across industry, i.e. too much time spent con configuration Basic ROI calculations justified the implementation and support of a data science hub Data science platform “a ha” moment came when pitching a solution to consolidate project workspace environments for different people across different organizations, in particular for Exploratory Data Analysis (EDA). Educational institutions are usually constrained by budget requirements, however, after providing ROI numbers on how much time and effort Teachers Assistants (TA’s) spent on providing technical support for their users, the decision to implement a data science platform was a no brainer. Nevertheless, we had the suspicion that the enterprise (SMB’s and large enterprises alike) were encountering the same challenges but were exacerbated due to the fact that more personas were involved within data science and analytics teams.
Disparate teams Data scientists siloed from the rest of the organization Ultimate goal is to automate certain processes within the organization Automation helps improve the top and bottom lines, improves competitiveness
Organizations struggle to become ‘data driven’. What does that mean? Data driven organizations are those that wish to use the data they have available to improve insights and allow their business to become more competitive. Assuming the organization has successfully consolidated their data into central data warehouses or data lakes, and assuming this data is defined with standard schemas, data science and data analytics teams have the power to analyze the data, obtain valuable insights and start improving the agility of their organizations with ‘prescriptive analytics’ and ‘predictive analytics’. Prescriptive analytics involves creating machine learning and deep learning models that automate certain business processes, such as:
Automatically tag images with classification types (cat or not a cat) Automatically classifying a customer with the probability that the customer will churn Recommendations for value added products to improve the checkout dollar amount at an e-commerce site Spam or not spam
However, organizations have struggled to integrate data scientists into their organizations. Data science teams that just ‘do the math’ and create visualizations on an organization's data sets to not provide much value in and of itself. Creating a machine learning model that automatically recommends a product that is not strategic to the organization does not provide much value.
Dashboards democratize data so that team members can quickly absorb meaningful insights and key performance indicators. Exploratory data analysis (EDA) and model creation/deployment not really a part of the picture.
Traditional Business Intelligence tools have been around for years. Some tools offer specific integrations into a variety of data sources and allow users to quickly create rich and interactive visualizations of their data. SQL, a language made popular by relational databases, is a very popular language for analytics. New developments help accelerate the time from data source to dashboard with in memory calculations, GPU powered databases, among others.
Big data tools, such as Hadoop and Spark allow users to create dashboards from large data sets. However, BI tools rely traditionally on structured data. Also, traditional dashboards don’t take into account how to create machine learning and deep learning models.
Just a review of a Data Scientist’s skill set.
Talking points: Organizations realize they need to automate their processes and that automation must come from real time analysis of data points The deliverable is not just a BI dashboard anymore, the deliverable is a deployable machine learning and deep learning model Embedding a data science team member into the group increases value As mentioned, historically data science teams have been isolated from the rest of the organization. Successful data driven organizations embed their data scientists into various business groups. For example: data extraction and loading into a warehouse table are done by engineering teams, however, a data science liaison, embedded within a certain department or relevant company wide project, can help data engineers improve the schema definition for the data being exposed which could save valuable time during the exploratory data analysis phase. Data engineers can create tables using their favorite Extract Transform and Load (ETL) tools to remove not-a-number (NaN) rows, remove columns that are irrelevant such as data base PK/FK’s, etc. Inversely, the data scientist could help the person telling the data story (could be anyone in the group, including herself) what features are relevant, how the certain normalizations were completed without delving into the technical details, etc. “This was the only customer that bought a widget in Atlanta so the attributes for this person were adjusted to not skew the dataset in their favor”.
Move from prescriptive to predictive analytics Deliver a machine learning or deep learning model that will allow organizations to automate processes Visualizations are still important, but used for telling a data story for EDA and also for visualizing how models are behaving in real time
Predictive analytics looks at the historical trends in data to provide insights. Organization members are then tasked to optimize processes to improve organizational results based on trends. However, companies need to automate tasks (remove the human from the actual task execution) based on certain indicators. In this case, visualizations are used in EDA to better understand the data with the goal of creating and deploying machine learning and deep learning models that can automate certain organization processes.
Support data source imports from multiple sources EDA needed as first step to build and deliver artifacts to automate business processes. Artifacts in this context are machine learning and deep learning models. Data engineers and DevOps need access to data science hub to streamline their own processes
Traditional teams use Excel spread sheets, among other tools, and are flying back and forth with emails, chat applications or external project management solutions. Even if all users work within shared environments such as Google Docs or Office 365, teams had no way of sharing all files and tools within one common environment, particularly for exploratory data analysis, since viewing and editing files within these environments are constrained to a certain set of file formats. Nevertheless, certain organizations and individuals prefer one language over the other. For example, a data science team involved with the Finance department may be more involved with using the R programming language, and the data science team involved with the marketing department may be more involved with the Python programming environment. In both cases, users may use multiple tools for one language. For example, some individuals may prefer RStudio for R, and others amy prefer using R with Jupyter Notebooks. Server management is important to optimize compute resources.
Traditional teams use Excel spread sheets, among other tools, and are flying back and forth with emails, chat applications or external project management solutions. Even if all users work within shared environments such as Google Docs or Office 365, teams had no way of sharing all files and tools within one common environment, particularly for exploratory data analysis, since viewing and editing files within these environments are constrained to a certain set of file formats. Nevertheless, certain organizations and individuals prefer one language over the other. For example, a data science team involved with the Finance department may be more involved with using the R programming language, and the data science team involved with the marketing department may be more involved with the Python programming environment. In both cases, users may use multiple tools for one language. For example, some individuals may prefer RStudio for R, and others amy prefer using R with Jupyter Notebooks.
A central source for project files alleviates compliance requirements. Usually, data engineers (either due to security requirements or simply that they don’t want to surface multiple schemas/formats for different users) would rather deliver the data product to a ‘clean’ table, so data scientists can do their work using self service approaches. Having a centrally managed set of files for specific projects also helps keeps things organized when different users are accessing project files, so version control becomes important as well.
Greg Werner, CEO & Founder, 3Blades.io at MLconf ATL 2017
Data Science with Teams
Improve the efficiency of your data science teams
with platforms that enhance collaboration and
● Some Background
● Data Science Project Teams
● Some Solutions
Integration experience with Oil & Gas, Financial, Insurance and Retail industries in
What did these customers have in common? All had data science teams that
worked in Silos
Difficulties when taking a data science
What’s going on here?
We started to do some digging!
Data Teams - The Old Way
Department Teams Data Scientist
Data Analyst IT Manager
The Analytics Deliverable
A dashboard! An interactive dashboard is even cooler.
Data Science Teams - The New Way
Data ScientistFinance Manager
Tax and Compliance
- Data Engineers
- Business Intelligence
The Data Science Deliverable
A machine or deep learning model!
I Want GPUs - And I Just Want Them to Work
Work around for NVidia Docker Wrapper:
- nvidia-docker -d -p 8888:8888 tensorflow/tensorflow:latest-gpu
- docker run -ti --rm `curl -s http://localhost:3476/docker/cli`
- docker run -ti --rm --volume-driver=nvidia-docker --
volume=nvidia_driver_375.82:/usr/local/nvidia:ro --device=/dev/nvidiactl --
device=/dev/nvidia-uvm --device=/dev/nvidia0 nvidia/cuda nvidia-smi
The Need for DevOps Chops
Reverse Proxy with consul-template
The old way... The new way...
Reverse Proxy with static upstream location
Uh-oh, someone has to manage this stuff!
Our Architecture - API First and Microservices
Provide flexibility with the tools that data scientists use for exploratory
data analysis and visualizations
One central source for project files with support for version control
Share visualizations from EDA
Train and save Machine Learning and Deep Learning models with
multiple frameworks, from within the same project
Streamline deployment pipelines