The Data Science Process - Do we need it and how to apply?

DECEMBER 15
GLOBAL AI BOOTCAMP IS POWERED BY:
The Data Science Process in ML
How to Apply It and When do We Need It?

Thanks to our Sponsors:
Global Sponsor:
Venue Sponsor:

About me
• Software Architect @
o 16+ years professional experience
• Microsoft Azure MVP
• External Expert Horizon 2020
• External Expert Eurostars-Eureka, InnoFund Denmark
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev

AGENDA
Major Tools
The Purpose of ML
AI as a Service
Iterative ML Process
Takeways
Demo

Machine Learning and Microsoft
• Azure ML integrated, end-to-end data science and advanced analytics
• Microsoft ML related services/tools
• Highlights
o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker)
o Execute experiments in isolated environments and GPU-enabled VMs
DEPRECATED MAINTAINED AND IMPROVED
• (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI
• (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai
• (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark)
Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python)
“Machine Learning Service” (preview) • Azure Batch AI Training

Azure ML Workbench
Desktop application (Windows, macOS) with
• Built-in Jupyter Notebook services and Git integration
• End-to-end process support
o Model development and experimentation (Python)
o Powerful inspectors for data analysis
o Data transformations by example
o Model history and deployment
• Easy to use
and resource hungry 
* Replaced in Sept 24 2018 release to make way for an improved architecture
(ref. to Azure ML SDK for Python or Azure Databricks for big datasets)

Azure ML Studio
• Visual workspace to build, test and deploy ML solutions
• Highlights
o X-browser drag and drop, no programming
o Rich set of modules
o Fits beginners and advanced users
o Unlimited extensibility (R Script, Python Script)
o Enterprise grade cloud service (SLA 99.95%)
o ML REST web services consumption
o Jupyter Notebook
o Azure AI Gallery (9000+ samples)
• At what price?
o Free plan available (10GB storage, 2 web services, 1000 requests/month)
o $10 seat/month + $1 experiment/hour

Azure Data Science VM
• Pre-configured cloud environment for AI & Data Science
• Highlights
o Fully operational environment
o 50+ tools DEV, ML, BigData, Data management
o Windows and Linux (Ubuntu/CentOS)
o Updated every few months
o On-demand elastic capacity
o GPU optimized VMs for deep learning
o Up to 4x GPUs NV K80 or V100
o Up to 128 vCPU, up to 6’144 GiB RAM
• At what price?
o From $11.76/month to $14’314/month

• Cloud-based environment to develop, train, test, deploy,
manage, and track ML models
• Highlights
• Model management
• Distributed deep learning
• Version control and reproducibility
• Hybrid deployment (Local, Cloud, Edge)
• Automated ML (data prep, algorithm, parameters)
• Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker)
• Scale up or out with large GPU-enabled clusters in the cloud
• At what price?
• From $23.51/month to $29’143.94/month
Azure ML Service (preview)

The purpose of ML modelling is:
• Generate predictions
• Understand true relations

Machine Learning Challenges
• Asking the right questions
• Typically 1 Model = 1 Question
• Requires training data
o Real-world data is messy (wrong or missing data)
o Feature engineering transforms to predictive features
o Feature extraction ( i.e. IP Address -> population density)
o Feature selection for informative features
• Overfitting model
o “Kicks ass” while training ,
o fails badly on real predictions
• Model validation
o “Sense” how well model works on new data

Users’ expectations:
• Engaging experience
• Effortless interaction
• High performance
• Relevant content
Businesses aim:
• Provide high value
• Faster and at low cost
o Data science talent
o Powerful infrastructure
o Continuous improvement
The developer role is to
bridge the gap:

Artificial Intelligence as a Service (AIaaS)
Def: Artificial intelligence off the shelf
• Bots and NLP – commands and guidance
• Cognitive APIs – speech, vision, translation, knowledge
• ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service)
• Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio)
• Innovation w/o upfront costs and expertise
• Usability – easy learning curve
• Scalability – start PoC, grow big
• Flexi cost – know what you pay for
• Share data with vendors
• Data regulations (i.e. GDPR)
• Reduced transparency
• Breaking changes

AIaaS market expected to
grow from $1.5Bil (2018) to
10.9Bil (2023)
(ResearchAndMarket Apr’ 2018)
1 year ago this was not as
achievable as it is now.

Some Key Azure AIaaS
Computer Vision
• Advanced algorithms for processing images for information
Face API
• Detect and analyze facial attributes
Custom Vision API
• Build, deploy, improve custom image classifiers (on tags)
LUIS.ai
• Apply custom ML intelligence
to conversational natural language
Custom Decision (experimental)
• Learn behavioural patterns of users

• Appealing
o 64% believe they are working in this century’s most “sexiest” job
• In demand
o 90% contacted at least once a month with job offer
o 50% - weekly, 30% - several times/week, 35% have <2y experience
• The dark side…
o All models are wrong, some are useful
o 80% time is data preparation
o Real life, not academic problems
o Non-linear hypothesis testing
o No full automation
• No one cares how you do it
The Data Scientist Job

Automated ML (AML)
AML is a recommender system for ML pipelines to achieve accuracy with less time
• Problem: Complexity scales faster that time available
• Highlights
o Designed to not look at customer data
o Only each pipeline result is sent to automated ML service
o Data pre-processing, algorithm experimentation, hyperparameters tunings
• How it Works
o Select algorithm: classification(11), regression(9), forecasting(9)
o Specify labeled data source and format (Numpy array, Pandas dataframe)
o Configure target for training (local, remote VM, AML Compute)
o Set AML configuration
automl_classifier = AutoMLConfig(
task='classification',
primary_metric='AUC_weighted',
max_time_sec=12000,
iterations=50, X=F_Train,
y=F_Label,
n_cross_validations=2)
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-
configure-auto-train

Data Understanding (Titanic Dataset)
• Mosaic plot
o Categorical distribution
o Visualizes the relation between X and Y
o Strong relation = Y-splits are far apart
o Conclusion: Women have higher survival rate
• Box plot
o Continuous distribution of numeric var
o IQR = middle 50%
o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR]
o Conclusion: High fares have higher survival rate
• Scatter plot
o How much a variable determines another
o Conclusion: Infants and men 25-45 y
have higher survival rate

• Make features usable
o Numerical
o Categorical (i.e. week day)
o PCA dimensionality reduction
o Dummy variables
• Handle missing data
• Normalize data
o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1])
o Value range influence the importance of the feature compared to other
Data Preprocessing

Feature Engineering, Feature Extraction
Increase predictive power by creating features on raw data
• Features closely related to target (predict default –> debt / balance ratio)
• Easier interpretation (Date to Year/Month/Day/Hour)
• Lag features to “look back” before the date (1, 2,… N days ago)
• Categorical features - identify discrete features
• Rolling aggregates
• smoothening over time window
• Check Azure team data science process
https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features

Note: All information is encoded in the digital media
• Images
o Step 1: Colour statistics, EXIF metadata, edges, shapes
o Step 2: Extract knowledge in fixed set of numeric characteristics
• Text
o Step 1:
• Bagging, N-grams, term frequency, topic modelling, stemming
• Named entity recognition (i.e. Wikipedia)
o Step 2: Extract knowledge in fixed set of numeric characteristics
Digital Media Feature Engineering

Feature Selection - select the
most predictive features
For many ML problems, having
a lot of data is a good thing;
but it can sometimes be a curse

Selecting Good Features
• Motivation
o Not only prediction but identification of predictive features
o Computational costs are related to number of features
o Limit external sensors and data sources
• Approach
o Trying all combinations of features? ( that would be infeasible)
• Methods
o Forward selection & Backward elimination
o Filter - Independent from the ML algorithm
o Embedded – Built-in search for predictive features in ML algorithm
o Wrapper – Measure feature usefulness while ML training

Tuning Model Parameters
• Model parameters control inner behaviour
o More sophisticated algorithm, more parameters
o i.e. Locally Deep SVM with kernel
o Kernel type, kernel coefficient
• How parameter tuning works?
1. Choose metric for evaluation (AUC - classification, R2-regression, etc.)
2. Select parameters for optimization
3. Define a grid as Cartesian product between arrays
4. For each combination, cross-validate on training set
5. Select the parameters for the best evaluation
Note: Expected improvement is 3%-8%

Appropriate Algorithms are
Determined by Data

Types of Algorithms
• Linear Algorithms
• Classification - classes separated by straight line
• Support Vector Machine – wide gap from line
• Regression – linear relation variables-label
• Non-Linear Algorithms
• Decision Trees and Jungles - divide space into regions
• Neural Networks – complex and irregular boundaries
• Special Algorithms
• Ordinal Regression – ranked values (i.e. race)
• Poisson Regression - discrete distribution (i.e. nr. of events)
• Bayesian – normal distribution of errors (bell curve)

False AlarmsFalse Alarms have serious impact
• Degraded confidence in the
system
• Loss of revenue
• Loss of brand image

Performance Metrics
• Regression model
o Root Mean Squared Error (RMSE)
o Coefficient of Determination, R2 ϵ [0;1]
• Multi-class classification model
o Confusion matrix
• Binary classification model
o Accuracy based on correct answers
o Area under ROC curve (AUC)
o Threshold
o Precision = TP / (TP + FP)
o Recall = TP / (TP + FN)
o Cost-Balanced (F1)

Handling Imbalanced Data
• Imbalanced: more examples of one class than others (0.001%)
• Errors are not the same
o Prediction of minority class (failures) is more important
o Asymmetric cost (false negative can cost more than false positive)
• Compromised performance of standard ML algorithms
o For 1% minority class, Accuracy of 99% does not mean useful model
o PR-curve is better for imbalanced data
• Oversampling
o SMOTE – allows better learning
o Generate examples combining features of target with features of neighbours

Takeaways
• Team Data Science Process
o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/
• ML in the Microsoft World
o https://docs.microsoft.com/en-us/azure/machine-learning/
• Python for AI
o https://wiki.python.org/moin/PythonForArtificialIntelligence
• Data Science Blog
o https://data-flair.training/blogs/category/machine-learning/
• Starter Books
o Free e-books download link:
https://www.manning.com/books/exploring-data-science

Azure ML StudioAzure ML Workbench

The Data Science Process - Do we need it and how to apply?

More Related Content

What's hot

Similar to The Data Science Process - Do we need it and how to apply?

More from Ivo Andreev

Recently uploaded

The Data Science Process - Do we need it and how to apply?