This document provides an overview of Microsoft technologies for operational machine learning and data science. It introduces concepts of data science and machine learning, then discusses how to move from experimental to operational machine learning. It outlines various Microsoft technologies for advanced analytics including Azure Machine Learning, Microsoft R Server, SQL Server R Services, SQL Server Analysis Services, Azure Cognitive Services, and Spark ML on HDInsight. For each technology, it provides brief descriptions and examples of capabilities. The document aims to help users understand how to apply these Microsoft technologies for data science and machine learning projects.
Hello everyone and welcome to the last day of Sqlbits…
My name is Khalid Salama. I work at Hitachi Consulting, in this Business Insights & Analytics practice, focusing on designing and delivering Data & Analytics Solutions
I n this session, I would like to explore with you the various Microsoft technologies that can help to operationalize your Machine Learning pipelines and enable
scalable data science.
Well, it’s more of an engineering session than a data science one to be fair, however, I think it is an important topic to discuss because,
data science is perceived as experimental, isolated activity…
While in many contemporary applications, specially with the rise of digital transformation and IoT, your data science products need to be incorporated with your operational systems,
and you ML pipelines need to be an integral part of your ETL process.
So, we will try to touch on various the Microsoft options to perform both experimental data science and operational ML.
So without over due, we have a lot of ground to cover…
I’ll start with a very quick intro to data science, I assume everybody here has “a” background on data science
Then, I give some insights on the difference between exploratory data science and Operational ML
After that, we are going to delve into the MS technologies for Advanced Analytics and show several demos….
And finally, I will conclude with some general remarks.
So let’s get started: Data Science and Machine Learning
So Data Science, also has been known in the academia as data mining, is the process of discovering interesting & useful patterns hidden in your data…
It’s an interdisciplinary field of computing, where concepts and techniques for different areas are employed, such as statistics, artificial intelligence, machine learning, databases, and others, like, visualization, big data, cloud computing, et cetera…
The objective of data science is to provide you with actionable insights, that is, valuable findings that support decision making, for example, whether to invest in this new product line, or whether to perform a certain critical medical operation….
These findings may form a reliable model that is able, to a certain degree of confidence, to predict, estimate, or forecast a certain value or a future event, which can allow your business to better optimize its responsive actions… for instance, to perform stock optimization or service-package personalization
There is an enormous number of data science & ML applications in many business domains, of which some I had the opportunity to work on, including Customer Propensity modelling for campaigning and churn analysis, automatic risk detection in security operations, Social media & Customer Feedback analysis, demand sensing, and many others….
The principal data mining tasks are classification, regression, clustering, Association Rule Discovery (or frequent pattern mining), time series analysis, probabilistic modelling, similarity analysis, and collaborative filtering
Each focuses on tackling a particular analytics challenge
Some, of course, fall under the category of supervised learning, others are considered unsupervised learning techniques.
Now let’s take a look onto the activities of any a science process,
to try to discriminate between experimental data science and operational machine learning
It starts with an exploratory data analysis phase…
After being presented with an analytics problem, you start with collecting the relevant data and importing it to your environment…
Then you blend this data by performing some generic data engineering tasks, such as merging, joining, aggerating, and so on….
After that, you apply some machine learning-specific data preparation tasks, also know as features engineering, including features construction, extraction, selection, and feature tuning, like scaling, handling missing values & outliers, and so on.
The output of this phase is a learning dataset, that will be used in your ML experimentation phase.
In this phase, you perform iterative steps training & testing to select the algorithm & parameters that
produce the model that best captures the hidden patterns in your data…
The final output of this whole experimentation phase is a report of findings, along with comprehensive visuals. That can be in the form of a markdown file, using jupyter notebooks, that tills the end-to-end data analysis story and support reproducibility.
These results may lead to a specific decision or recommendation.
In some scenarios, these results are the ultimate output of the data science activity
However, in many other scenarios, where you need repeated and real-time intelligence, such as targeted advertising and recommender systems, you need to productionize the models produced from the previous data science process, and integrate them with your operational systems to perform online predictions and recommendations
In which case, the whole ML pipeline, including data ingestion, processing, model training and/or scoring, needs to be a repeatable, automated process
The process should produce a model that exposes Web API to be integrated with your operational apps and consumed real-time
Now let’s switch gears now and talk about technology…
You have probably seen this diagram more that 100 times during the last couple of days
This is the Cortana Intelligence Suite, which provides you with a plethora of services to build end-to-end, batch and real-time, data analytics platform…
You can visit the Cortana Intelligence Gallery online to find many templates for Analytics Solutions
However, we are going to focus on only the machine learning and intelligence parts of it…
Here we have the different Microsoft Technologies for data science & machine learning, which are:
Azure Machine Learning
Microsoft R Server & SQL Server (in-database) R Services
Analysis Services Data Mining
Spark Machine Learning on Azure HDInsight
Cognitive Features in Azure Data Lake Analytics
Azure Cognitive Services, &
Microsoft Bot framework
Does anyone know any other Microsoft tool or technology for Data Mining or Machine Learning?
So I can declare this as an inclusive list
Alright, we will try to touch on each of these ones - except the Bot framework - discuss features and limitations, and show a simple demo for each.
So Let’s start with Azure Machine Learning…
Azure ML has around for while now, and gained a lot of popularity in both experimental data science and operationalizing ML models…
It is a cloud-based PaaS service
Provides an interactive (drag and drop) Data Science studio to perform you experiments
With a lot of built-in functionality for data processing and modelling
You can import data from different sources
However, the most interesting feature in Azure ML is that you can easily productionize your ML model as Web Services, so that you can integrate them in your ETL for training and batch scoring
Or your operational applications for real-time predictions
You can also extend your ML experiments using python and R scripts
Let’s quickly switch to the Azure portal and have a look at it
Microsoft R Server
Probably the most important analytics product for Microsoft at the moment….
If you are an R developer, you will probably know that open-source R has scalability limitations, because it is single-threaded and in-memory only…
You needed to use commercial R libraries to make your program multi-threaded, process your data partly in-memory and partly on-desk, so that you can handle data sizes bigger than your workstation’s memory, and run your R app on a cluster for distributed computing and scaling your data processing…
Well, Microsoft has acquired a company that builds such libraries, called Revolution Analytics, and included their open-source libraries in MRO, and their commercial ones in MRS
Besides, MSR Open has enhanced Math Kernel Library, for more efficient mathematical computations
and it is compatible with all R-related software
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
Let’s take a closer look to the main components of MS R Server
ScaleR – The core libraries in MS R, optimized for parallel execution and uses external data frames to overcome the memory limitation
ConnecR – provides access to various data sources including distributed file systems and relational databases
DistributeR – allows you R application to run in different execution context, including distributed one
So you can write you application once, and with a few lines of code, you can configure your application to run on different execution context in order to scale it
MS R Server Operationalization - allows you to deploy your R models, on a configured R Server, as Web APIs (similar to what we have seen in Azure ML) using msrdeploy libraries
Let’s have a look on a sample MS R code
SQL Server R Services…
Supporting the execution of R scripts within SQL Server has opened interesting opportunities to bring your analytics close your data, and integrate your Machine Learning process with your SQL-based ETL.
Now you can encapsulate you ML task, as R scripts, into a T-SQL stored procedure, to perform model training or batch scoring
Then you can call this proc from an SSIS package
As for a training pipeline, you use data in SQL Server Tables to train and R model, then you serialize this Model and stored and maintain it in a table
Similarly for a prediction pipeline, you load the R model that is stored in the tables, perform prediction using data in a SQL Table, and save back the results
This is all using sp_execute_external_script
Another advantage is that your ML scripts are managed and source controlled as part of your SQL Database project in Visual Studio
In terms of limitations, R Services are not supported in Azure SQL DB/DW, yet
It is not suitable for Interactive Data Science, and it’s a bit hard to debug.
It also supports R only, which is not ideal for Python Data Scientists. However, I think we will see SQL Server in-database Python Services in the future…
Let’s have a look on how this works
Well, I can’t talk about Microsoft & Machine Learning without mentioning my old friend Analysis Services Data Mining….
I’ve personally delivered a couple of interesting Data Mining Solutions using this technology
It has been around for more than 10 years, since SQL Server 2005, yet no one now is talking about it…
Although SQL Server Analysis Services has limited Data Mining features, as well as very limited extensibility (that is, you need to write you own ML algorithms and integrate them in using only C++)
I think it has a number of useful features that makes it a good candidate for delivering and productionizing an ML Solutions
It can process data from various OLEDB and ODBC data sources, that includes Azure SQL DB and DW,
It is very easy to build and deploy your data mining models in an Analysis Services database and use the model for batch prediction using DMX
Seamless integration with SSIS, and the latter includes special tasks to perform train and predict queries
However, the most interesting thing about Analysis Services is that provides very useful interactive console to explore and interpret the constructed models
It has an Excel Add-in to allow non-technical users to do Data Mining
In Analysis Services Data Mining, you build your mining structures based on the tables in the data source view.
Then you run an algorithm on the data in the mining structure to produce a mining model
The algorithms available are:…
With a wide range of parameter configurations
If you want your system to perform generic intelligence tasks, such as text analysis or image recognition, you should probably consider Azure Cognitive Services, before you build your own models..
You have a number of REST APIs that perform various analytics, including language, speech, vision, and search, that you can directly be consumed from you operational App…
Very good documentation is supplied for each API
You can create a cognitive service on you Azure Portal by specifying the API type and Service level (which affects how much you will pay), and look at the documentation to learn how consume a specific service
Let’s have a look on demo that uses the emotion API
While Azure Cognitive Services APIs are used in your operational apps for real-time predictions, Data Lake Analytics provide cognitive features to be integrated in your batch big data processing pipeline
This includes image and text analytics
So you can benefit from the scalability of the U-SQL jobs to process large amount of data, as well as the orchestration of azure data factory to streamline your process
In addition, Data Lake Analytics supports Python extension in the U-SQL Script, let’s have a look
So basically you can write some python function to process your data as part of the U-SQL Job
Now let’s see a Text Analytics demo using Data Lake Analytics Cognitive Features
If you are a spark developer, you will probably know that spark has rich ML libraries to perform various data processing and model learning tasks
It is scalable, and extensible (you can implement your own algorithms using PySpark, SparkR, Java, or Scala)…
Typically suitable for model training and batch scoring…
And as it is available on HDInsight, you Spark data processing and ML pipelines can be automated and scheduled using Azure Data Factory
Here is the link for the Data Factory Template for this…
Spark has standard APIs for ML
to make it easier to combine multiple task, including pre-processing and modelling,
onto a single workflow, or pipeline