SQL Server Machine Learning Services is an embedded, predictive analytics and data science engine that can execute R and Python code within a SQL Server database as stored procedures, as T-SQL script containing R or Python statements, or as R or Python code containing T-SQL.
The key value proposition of Machine Learning Services is the power of its proprietary packages to deliver advanced analytics at scale, and the ability to bring calculations and processing to where the data resides, eliminating the need to pull data across the network.
25. ### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
hdfsFS
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
31. launchpad.exe
sp_execute_external_script
sqlservr.exe
Named pipe
Each SQL
instance has a
launchpad
SQLOS
XEvent
MSSQLSERVER Service MSSQLLAUNCHPAD Service
“What” and “How”
to “launch”
“launcher”
Windows
“satellite” process
sqlsatellite.dll
Windows
“satellite” process
Windows
“satellite” process
Windows
“satellite” process
Windows
“satellite” process
Editor's Notes
Slide Objective:
Show the three pillars of Microsoft Advanced Analytics
Talking Points:
Microsoft’s Advanced Analytics products work with all your current investments – we support different platforms like Windows, Linux, SQL, Terada and even Big data. It works both on premise and in the cloud
Microsoft has for long been investing in innovative Artificial Intelligence technologies and baking them into our products like Cortana, HoloLens, Bing and Skype. We are now commercializing these technologies through our advanced analytics products including Microsoft R.
Microsoft want to help you accelerate the process of generating value from your data – which is why we are not only building the tools but investing heavily in creating solutions that can help you drive value.
What does R mean in “R services”. R is a statistical computing programming language based on an Open Source Standard, R Open.
Last but not least, customers need flexibility when it comes to the choice of platform, programming languages & data infrastructure to get from the most from their data.
Why? In most IT environments, platforms, technologies and skills are as diverse as they have ever been, the data platform of the future needs to you to build intelligent applications on any data, any platform, any language on premises and in the cloud.
SQL Server manages your data, across platforms, with any skills, on-premises & cloud
Our goal is to meet you where you are with on any platform, anywhere with the tools and languages of your choice.
SQL now has support for Windows, Linux & Docker Containers.
It allows you to leverage the language of your choice for advanced analytics – R & Python.
Slide objective
Show broad commitment to R by preserving freely available, enhanced editions, Windows and SQL Server editions and R Server editions for leading EDWs, Linux and Hadoop platforms.
Differentiate free, open editions from commercial by mentioning availability of commercial 24x7 support, and enhancements to support very large scale data analytics at speed.
Talking points
Notes
Slide objective
Illustrate the potential scale benefits possible with R Server’s ScaleR algorithms.
Show a representative example and explain the 3 mechanisms that help achieve the improvements.
.
Talking points
We tested the improved data and computational scale of the R Server’s ScaleR library of enhanced, parallelized algorithms. This is an example.
Speed:
On a 4 core laptop, with 8GB of RAM, open source R could process about 300,000 events in a particular data set prior to exhausting available memory.
The test tool about 77 seconds to run the most commonly used R linear regression algorithm called “lm”.
We than ran the same test using our parallelized, rewritten (in C++) linear regression module called rxLinMod.
Data Scale
Algorithms in the ScaleR library are also rewritten to analyze data in “chunks” to eliminate the memory-limits of typical open source R algorithms.
Where the open source lm exhausted memory at about 300,000 events, the improved rxLinMod was working fine at 5 million events where we stopped testing.
The result is a 50x performance improvement over open source linear regression, and no memory limits.
Parallel Scale
This example shows only the effects of running optimized, compoiled code on all cores of a laptop. Greater benefits are available.
What is not shown, is that the work done to parallelize across 4 cores can also be utilized to scale across many nodes in systems such as EDWs and Hadoop.
While results vary, the system, as you can see, responds linearly with respect to data size. Rehosting using R Server for Hadoop can provide even more dramatic speed and scale results.
Notes
- Wrangle data, experiment with models, and test models from a workstation
- Use your favorite IDE or notebook service
- Train models on big data, at speed, in parallel
- Transform large data sets using T-SQL, R, and Python
- Repeatedly score and rescore large data assets
- Embed R or Python in T-SQL
- Execute using T-SQL BI, reporting & app dev tools
- Embed R and Python within T-SQL scripts
- Makes R & Python callable from traditional applications
- Deploy smart apps using existing skills & tools
- Run trained models in real-time with low latency
- Detect anomalies at speed
Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling
Slide objective
Introduce the high –level value of R Server and R Services over other instantiations of the R language.
Talking points
R Server products provide an enhanced experience for the R User without loss of compatibility
R Server products are “open core” – the utilize the open source R product entirely and build new capabilities around that core without impacting compatibility.
Users of R Server products enjoy full compatibility with open source compatible with the entire (and vast) collection of algorithms, connectors, visualization tools shared openly via CRAN, Bioconductor and other shared resources like GitHub.
Key extensions enable R to tackle big data challenges that exceed the capacity of open source R.
R Scripts built for one platform using R Server can be easily run on another platform running R Server
We call it WODA – write once, deploy anywhere.
Two key contributions:
Build on any version of the product and deploy using other versions
Investment protection as platform choices change
Develop on the desktop and immediately deploy to RDBMS – SQL Server, EDW (SQL Server & Teradsata) or Hadoop (Microsoft, Cloudera, Hortonworks and MapR)
Notes
Slide Objective
Present the range of already parallelized functions and algorithms available with RevoScaleR
Talking Points
This list shows the functions and algorithms that are available with all versions of R Server.
We call this the ScaleR Library.
Each function can:
Execute work in step in parallel or serial as needed
Process work using multiple threads, cores, sockets or nodes
Process one or more data block in each thread, core, socket or node
Combine the results into a single mathematically correct answer
Do the work either locally or ship the request to another system for completion remotely.
Completely obscure the complexity of parallelization, multiple steps and iterations from the R programmer
Four functions, rxDataStep, rxExec, PEMA-R, and the newest rxExecBy provide frameworks for users to write their own routines – functions – algorithms using parallelization.
While more difficult than pre-written PEMAs, the results are portable – usable on multiple systems
Easier than writing directly to the platform to create custom algorithms.
Notes:
One algorithm framework – PEMA-R API, is not available in clustered systems – an exception to creating portability across systems.
sp_execute_external_script is an example of a special proc or specproc. The source code of the procedure can’t be found in the resource db. It is implemented in our source code
Talk about other “extensible” environments we have used in the past
xproc
sp_OA
Linked servers
Full-text