DECEMBER 15
GLOBAL AI BOOTCAMP IS POWERED BY:
The Data Science Process in ML
How to Apply It and When do We Need It?
Thanks to our Sponsors:
Global Sponsor:
Venue Sponsor:
About me
• Software Architect @
o 16+ years professional experience
• Microsoft Azure MVP
• External Expert Horizon 2020
• External Expert Eurostars-Eureka, InnoFund Denmark
• Business Interests
o Web Development, SOA, Integration
o IoT, Machine Learning, Computer Intelligence
o Security & Performance Optimization
• Contact
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev
AGENDA
Major Tools
The Purpose of ML
AI as a Service
Iterative ML Process
Takeways
Demo
Machine Learning and Microsoft
• Azure ML integrated, end-to-end data science and advanced analytics
• Microsoft ML related services/tools
• Highlights
o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker)
o Execute experiments in isolated environments and GPU-enabled VMs
DEPRECATED MAINTAINED AND IMPROVED
• (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI
• (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai
• (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark)
Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python)
“Machine Learning Service” (preview) • Azure Batch AI Training
Azure ML Workbench
Desktop application (Windows, macOS) with
• Built-in Jupyter Notebook services and Git integration
• End-to-end process support
o Model development and experimentation (Python)
o Powerful inspectors for data analysis
o Data transformations by example
o Model history and deployment
• Easy to use
and resource hungry 
* Replaced in Sept 24 2018 release to make way for an improved architecture
(ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
Azure ML Studio
• Visual workspace to build, test and deploy ML solutions
• Highlights
o X-browser drag and drop, no programming
o Rich set of modules
o Fits beginners and advanced users
o Unlimited extensibility (R Script, Python Script)
o Enterprise grade cloud service (SLA 99.95%)
o ML REST web services consumption
o Jupyter Notebook
o Azure AI Gallery (9000+ samples)
• At what price?
o Free plan available (10GB storage, 2 web services, 1000 requests/month)
o $10 seat/month + $1 experiment/hour
Azure Data Science VM
• Pre-configured cloud environment for AI & Data Science
• Highlights
o Fully operational environment
o 50+ tools DEV, ML, BigData, Data management
o Windows and Linux (Ubuntu/CentOS)
o Updated every few months
o On-demand elastic capacity
o GPU optimized VMs for deep learning
o Up to 4x GPUs NV K80 or V100
o Up to 128 vCPU, up to 6’144 GiB RAM
• At what price?
o From $11.76/month to $14’314/month
• Cloud-based environment to develop, train, test, deploy,
manage, and track ML models
• Highlights
• Model management
• Distributed deep learning
• Version control and reproducibility
• Hybrid deployment (Local, Cloud, Edge)
• Automated ML (data prep, algorithm, parameters)
• Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker)
• Scale up or out with large GPU-enabled clusters in the cloud
• At what price?
• From $23.51/month to $29’143.94/month
Azure ML Service (preview)
The purpose of ML modelling is:
• Generate predictions
• Understand true relations
Machine Learning Challenges
• Asking the right questions
• Typically 1 Model = 1 Question
• Requires training data
o Real-world data is messy (wrong or missing data)
o Feature engineering transforms to predictive features
o Feature extraction ( i.e. IP Address -> population density)
o Feature selection for informative features
• Overfitting model
o “Kicks ass” while training ,
o fails badly on real predictions
• Model validation
o “Sense” how well model works on new data
Users’ expectations:
• Engaging experience
• Effortless interaction
• High performance
• Relevant content
Businesses aim:
• Provide high value
• Faster and at low cost
o Data science talent
o Powerful infrastructure
o Continuous improvement
The developer role is to
bridge the gap:
Artificial Intelligence as a Service (AIaaS)
Def: Artificial intelligence off the shelf
• Bots and NLP – commands and guidance
• Cognitive APIs – speech, vision, translation, knowledge
• ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service)
• Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio)
• Innovation w/o upfront costs and expertise
• Usability – easy learning curve
• Scalability – start PoC, grow big
• Flexi cost – know what you pay for
• Share data with vendors
• Data regulations (i.e. GDPR)
• Reduced transparency
• Breaking changes
AIaaS market expected to
grow from $1.5Bil (2018) to
10.9Bil (2023)
(ResearchAndMarket Apr’ 2018)
1 year ago this was not as
achievable as it is now.
Some Key Azure AIaaS
Computer Vision
• Advanced algorithms for processing images for information
Face API
• Detect and analyze facial attributes
Custom Vision API
• Build, deploy, improve custom image classifiers (on tags)
LUIS.ai
• Apply custom ML intelligence
to conversational natural language
Custom Decision (experimental)
• Learn behavioural patterns of users
• Appealing
o 64% believe they are working in this century’s most “sexiest” job
• In demand
o 90% contacted at least once a month with job offer
o 50% - weekly, 30% - several times/week, 35% have <2y experience
• The dark side…
o All models are wrong, some are useful
o 80% time is data preparation
o Real life, not academic problems
o Non-linear hypothesis testing
o No full automation
• No one cares how you do it
The Data Scientist Job
Automated ML (AML)
AML is a recommender system for ML pipelines to achieve accuracy with less time
• Problem: Complexity scales faster that time available
• Highlights
o Designed to not look at customer data
o Only each pipeline result is sent to automated ML service
o Data pre-processing, algorithm experimentation, hyperparameters tunings
• How it Works
o Select algorithm: classification(11), regression(9), forecasting(9)
o Specify labeled data source and format (Numpy array, Pandas dataframe)
o Configure target for training (local, remote VM, AML Compute)
o Set AML configuration
automl_classifier = AutoMLConfig(
task='classification',
primary_metric='AUC_weighted',
max_time_sec=12000,
iterations=50, X=F_Train,
y=F_Label,
n_cross_validations=2)
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-
configure-auto-train
Iterative ML Process
Data Understanding (Titanic Dataset)
• Mosaic plot
o Categorical distribution
o Visualizes the relation between X and Y
o Strong relation = Y-splits are far apart
o Conclusion: Women have higher survival rate
• Box plot
o Continuous distribution of numeric var
o IQR = middle 50%
o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR]
o Conclusion: High fares have higher survival rate
• Scatter plot
o How much a variable determines another
o Conclusion: Infants and men 25-45 y
have higher survival rate
• Make features usable
o Numerical
o Categorical (i.e. week day)
o PCA dimensionality reduction
o Dummy variables
• Handle missing data
• Normalize data
o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1])
o Value range influence the importance of the feature compared to other
Data Preprocessing
Feature Engineering, Feature Extraction
Increase predictive power by creating features on raw data
• Features closely related to target (predict default –> debt / balance ratio)
• Easier interpretation (Date to Year/Month/Day/Hour)
• Lag features to “look back” before the date (1, 2,… N days ago)
• Categorical features - identify discrete features
• Rolling aggregates
• smoothening over time window
• Check Azure team data science process
https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
Note: All information is encoded in the digital media
• Images
o Step 1: Colour statistics, EXIF metadata, edges, shapes
o Step 2: Extract knowledge in fixed set of numeric characteristics
• Text
o Step 1:
• Bagging, N-grams, term frequency, topic modelling, stemming
• Named entity recognition (i.e. Wikipedia)
o Step 2: Extract knowledge in fixed set of numeric characteristics
Digital Media Feature Engineering
Feature Selection - select the
most predictive features
For many ML problems, having
a lot of data is a good thing;
but it can sometimes be a curse
Selecting Good Features
• Motivation
o Not only prediction but identification of predictive features
o Computational costs are related to number of features
o Limit external sensors and data sources
• Approach
o Trying all combinations of features? ( that would be infeasible)
• Methods
o Forward selection & Backward elimination
o Filter - Independent from the ML algorithm
o Embedded – Built-in search for predictive features in ML algorithm
o Wrapper – Measure feature usefulness while ML training
Tuning Model Parameters
• Model parameters control inner behaviour
o More sophisticated algorithm, more parameters
o i.e. Locally Deep SVM with kernel
o Kernel type, kernel coefficient
• How parameter tuning works?
1. Choose metric for evaluation (AUC - classification, R2-regression, etc.)
2. Select parameters for optimization
3. Define a grid as Cartesian product between arrays
4. For each combination, cross-validate on training set
5. Select the parameters for the best evaluation
Note: Expected improvement is 3%-8%
Appropriate Algorithms are
Determined by Data
Types of Algorithms
• Linear Algorithms
• Classification - classes separated by straight line
• Support Vector Machine – wide gap from line
• Regression – linear relation variables-label
• Non-Linear Algorithms
• Decision Trees and Jungles - divide space into regions
• Neural Networks – complex and irregular boundaries
• Special Algorithms
• Ordinal Regression – ranked values (i.e. race)
• Poisson Regression - discrete distribution (i.e. nr. of events)
• Bayesian – normal distribution of errors (bell curve)
False AlarmsFalse Alarms have serious impact
• Degraded confidence in the
system
• Loss of revenue
• Loss of brand image
Performance Metrics
• Regression model
o Root Mean Squared Error (RMSE)
o Coefficient of Determination, R2 ϵ [0;1]
• Multi-class classification model
o Confusion matrix
• Binary classification model
o Accuracy based on correct answers
o Area under ROC curve (AUC)
o Threshold
o Precision = TP / (TP + FP)
o Recall = TP / (TP + FN)
o Cost-Balanced (F1)
Handling Imbalanced Data
• Imbalanced: more examples of one class than others (0.001%)
• Errors are not the same
o Prediction of minority class (failures) is more important
o Asymmetric cost (false negative can cost more than false positive)
• Compromised performance of standard ML algorithms
o For 1% minority class, Accuracy of 99% does not mean useful model
o PR-curve is better for imbalanced data
• Oversampling
o SMOTE – allows better learning
o Generate examples combining features of target with features of neighbours
Takeaways
• Team Data Science Process
o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/
• ML in the Microsoft World
o https://docs.microsoft.com/en-us/azure/machine-learning/
• Python for AI
o https://wiki.python.org/moin/PythonForArtificialIntelligence
• Data Science Blog
o https://data-flair.training/blogs/category/machine-learning/
• Starter Books
o Free e-books download link:
https://www.manning.com/books/exploring-data-science
Azure ML StudioAzure ML Workbench

The Data Science Process - Do we need it and how to apply?

  • 1.
    DECEMBER 15 GLOBAL AIBOOTCAMP IS POWERED BY: The Data Science Process in ML How to Apply It and When do We Need It?
  • 2.
    Thanks to ourSponsors: Global Sponsor: Venue Sponsor:
  • 3.
    About me • SoftwareArchitect @ o 16+ years professional experience • Microsoft Azure MVP • External Expert Horizon 2020 • External Expert Eurostars-Eureka, InnoFund Denmark • Business Interests o Web Development, SOA, Integration o IoT, Machine Learning, Computer Intelligence o Security & Performance Optimization • Contact ivelin.andreev@icb.bg www.linkedin.com/in/ivelin www.slideshare.net/ivoandreev
  • 4.
    AGENDA Major Tools The Purposeof ML AI as a Service Iterative ML Process Takeways Demo
  • 5.
    Machine Learning andMicrosoft • Azure ML integrated, end-to-end data science and advanced analytics • Microsoft ML related services/tools • Highlights o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker) o Execute experiments in isolated environments and GPU-enabled VMs DEPRECATED MAINTAINED AND IMPROVED • (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI • (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai • (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark) Now called • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python) “Machine Learning Service” (preview) • Azure Batch AI Training
  • 6.
    Azure ML Workbench Desktopapplication (Windows, macOS) with • Built-in Jupyter Notebook services and Git integration • End-to-end process support o Model development and experimentation (Python) o Powerful inspectors for data analysis o Data transformations by example o Model history and deployment • Easy to use and resource hungry  * Replaced in Sept 24 2018 release to make way for an improved architecture (ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
  • 7.
    Azure ML Studio •Visual workspace to build, test and deploy ML solutions • Highlights o X-browser drag and drop, no programming o Rich set of modules o Fits beginners and advanced users o Unlimited extensibility (R Script, Python Script) o Enterprise grade cloud service (SLA 99.95%) o ML REST web services consumption o Jupyter Notebook o Azure AI Gallery (9000+ samples) • At what price? o Free plan available (10GB storage, 2 web services, 1000 requests/month) o $10 seat/month + $1 experiment/hour
  • 8.
    Azure Data ScienceVM • Pre-configured cloud environment for AI & Data Science • Highlights o Fully operational environment o 50+ tools DEV, ML, BigData, Data management o Windows and Linux (Ubuntu/CentOS) o Updated every few months o On-demand elastic capacity o GPU optimized VMs for deep learning o Up to 4x GPUs NV K80 or V100 o Up to 128 vCPU, up to 6’144 GiB RAM • At what price? o From $11.76/month to $14’314/month
  • 9.
    • Cloud-based environmentto develop, train, test, deploy, manage, and track ML models • Highlights • Model management • Distributed deep learning • Version control and reproducibility • Hybrid deployment (Local, Cloud, Edge) • Automated ML (data prep, algorithm, parameters) • Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker) • Scale up or out with large GPU-enabled clusters in the cloud • At what price? • From $23.51/month to $29’143.94/month Azure ML Service (preview)
  • 10.
    The purpose ofML modelling is: • Generate predictions • Understand true relations
  • 11.
    Machine Learning Challenges •Asking the right questions • Typically 1 Model = 1 Question • Requires training data o Real-world data is messy (wrong or missing data) o Feature engineering transforms to predictive features o Feature extraction ( i.e. IP Address -> population density) o Feature selection for informative features • Overfitting model o “Kicks ass” while training , o fails badly on real predictions • Model validation o “Sense” how well model works on new data
  • 12.
    Users’ expectations: • Engagingexperience • Effortless interaction • High performance • Relevant content Businesses aim: • Provide high value • Faster and at low cost o Data science talent o Powerful infrastructure o Continuous improvement The developer role is to bridge the gap:
  • 13.
    Artificial Intelligence asa Service (AIaaS) Def: Artificial intelligence off the shelf • Bots and NLP – commands and guidance • Cognitive APIs – speech, vision, translation, knowledge • ML frameworks – build own model w/o infrastructure (i.e. Azure ML Service) • Fully managed ML – templates, deployment, drag-drop (i.e. Azure ML Studio) • Innovation w/o upfront costs and expertise • Usability – easy learning curve • Scalability – start PoC, grow big • Flexi cost – know what you pay for • Share data with vendors • Data regulations (i.e. GDPR) • Reduced transparency • Breaking changes
  • 14.
    AIaaS market expectedto grow from $1.5Bil (2018) to 10.9Bil (2023) (ResearchAndMarket Apr’ 2018) 1 year ago this was not as achievable as it is now.
  • 15.
    Some Key AzureAIaaS Computer Vision • Advanced algorithms for processing images for information Face API • Detect and analyze facial attributes Custom Vision API • Build, deploy, improve custom image classifiers (on tags) LUIS.ai • Apply custom ML intelligence to conversational natural language Custom Decision (experimental) • Learn behavioural patterns of users
  • 16.
    • Appealing o 64%believe they are working in this century’s most “sexiest” job • In demand o 90% contacted at least once a month with job offer o 50% - weekly, 30% - several times/week, 35% have <2y experience • The dark side… o All models are wrong, some are useful o 80% time is data preparation o Real life, not academic problems o Non-linear hypothesis testing o No full automation • No one cares how you do it The Data Scientist Job
  • 17.
    Automated ML (AML) AMLis a recommender system for ML pipelines to achieve accuracy with less time • Problem: Complexity scales faster that time available • Highlights o Designed to not look at customer data o Only each pipeline result is sent to automated ML service o Data pre-processing, algorithm experimentation, hyperparameters tunings • How it Works o Select algorithm: classification(11), regression(9), forecasting(9) o Specify labeled data source and format (Numpy array, Pandas dataframe) o Configure target for training (local, remote VM, AML Compute) o Set AML configuration automl_classifier = AutoMLConfig( task='classification', primary_metric='AUC_weighted', max_time_sec=12000, iterations=50, X=F_Train, y=F_Label, n_cross_validations=2) https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to- configure-auto-train
  • 18.
  • 19.
    Data Understanding (TitanicDataset) • Mosaic plot o Categorical distribution o Visualizes the relation between X and Y o Strong relation = Y-splits are far apart o Conclusion: Women have higher survival rate • Box plot o Continuous distribution of numeric var o IQR = middle 50% o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR] o Conclusion: High fares have higher survival rate • Scatter plot o How much a variable determines another o Conclusion: Infants and men 25-45 y have higher survival rate
  • 20.
    • Make featuresusable o Numerical o Categorical (i.e. week day) o PCA dimensionality reduction o Dummy variables • Handle missing data • Normalize data o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1]) o Value range influence the importance of the feature compared to other Data Preprocessing
  • 21.
    Feature Engineering, FeatureExtraction Increase predictive power by creating features on raw data • Features closely related to target (predict default –> debt / balance ratio) • Easier interpretation (Date to Year/Month/Day/Hour) • Lag features to “look back” before the date (1, 2,… N days ago) • Categorical features - identify discrete features • Rolling aggregates • smoothening over time window • Check Azure team data science process https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
  • 22.
    Note: All informationis encoded in the digital media • Images o Step 1: Colour statistics, EXIF metadata, edges, shapes o Step 2: Extract knowledge in fixed set of numeric characteristics • Text o Step 1: • Bagging, N-grams, term frequency, topic modelling, stemming • Named entity recognition (i.e. Wikipedia) o Step 2: Extract knowledge in fixed set of numeric characteristics Digital Media Feature Engineering
  • 23.
    Feature Selection -select the most predictive features For many ML problems, having a lot of data is a good thing; but it can sometimes be a curse
  • 24.
    Selecting Good Features •Motivation o Not only prediction but identification of predictive features o Computational costs are related to number of features o Limit external sensors and data sources • Approach o Trying all combinations of features? ( that would be infeasible) • Methods o Forward selection & Backward elimination o Filter - Independent from the ML algorithm o Embedded – Built-in search for predictive features in ML algorithm o Wrapper – Measure feature usefulness while ML training
  • 25.
    Tuning Model Parameters •Model parameters control inner behaviour o More sophisticated algorithm, more parameters o i.e. Locally Deep SVM with kernel o Kernel type, kernel coefficient • How parameter tuning works? 1. Choose metric for evaluation (AUC - classification, R2-regression, etc.) 2. Select parameters for optimization 3. Define a grid as Cartesian product between arrays 4. For each combination, cross-validate on training set 5. Select the parameters for the best evaluation Note: Expected improvement is 3%-8%
  • 26.
  • 27.
    Types of Algorithms •Linear Algorithms • Classification - classes separated by straight line • Support Vector Machine – wide gap from line • Regression – linear relation variables-label • Non-Linear Algorithms • Decision Trees and Jungles - divide space into regions • Neural Networks – complex and irregular boundaries • Special Algorithms • Ordinal Regression – ranked values (i.e. race) • Poisson Regression - discrete distribution (i.e. nr. of events) • Bayesian – normal distribution of errors (bell curve)
  • 28.
    False AlarmsFalse Alarmshave serious impact • Degraded confidence in the system • Loss of revenue • Loss of brand image
  • 29.
    Performance Metrics • Regressionmodel o Root Mean Squared Error (RMSE) o Coefficient of Determination, R2 ϵ [0;1] • Multi-class classification model o Confusion matrix • Binary classification model o Accuracy based on correct answers o Area under ROC curve (AUC) o Threshold o Precision = TP / (TP + FP) o Recall = TP / (TP + FN) o Cost-Balanced (F1)
  • 30.
    Handling Imbalanced Data •Imbalanced: more examples of one class than others (0.001%) • Errors are not the same o Prediction of minority class (failures) is more important o Asymmetric cost (false negative can cost more than false positive) • Compromised performance of standard ML algorithms o For 1% minority class, Accuracy of 99% does not mean useful model o PR-curve is better for imbalanced data • Oversampling o SMOTE – allows better learning o Generate examples combining features of target with features of neighbours
  • 31.
    Takeaways • Team DataScience Process o https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/ • ML in the Microsoft World o https://docs.microsoft.com/en-us/azure/machine-learning/ • Python for AI o https://wiki.python.org/moin/PythonForArtificialIntelligence • Data Science Blog o https://data-flair.training/blogs/category/machine-learning/ • Starter Books o Free e-books download link: https://www.manning.com/books/exploring-data-science
  • 32.