Microsoft Azure + R

Data Science with Microsoft Azure and R

Data & Analytics

Microsoft Azure + R. Prototype to Product Challenge
Prototyping
Flexibility
Distributed
Scalable
Fault-tolerance
Reliable
Production
Flexibility
Distributed
Scalable
Fault-tolerance
Reliable
+ Big Data Ready
+ LSML
Black Magic!
Migration

Microsoft Azure + R. Hello R!
Python is a COOL language!
But R…
Specialized in statistical analyze
Time-effective => ideal for…
…prototyping, competition, researching, and for fun!
Standalone computing => not bad scalable 
Open source
Big bearded community

Storage
Resource
Management
ML Framework
Execution
Engine
Local OS
Local Disc
PythonRuntime
YetAnother
Runtime
scikit
learn
HDFS
YARN
MapReduce
Mahout
HDFS / S3
YARN /
Apache Mesos
Spark
MLlib
HDFS / S3
YARN /
Apache Mesos
Python / R
on Spark
Python/R
tools
Spark
Local PC Hybrid Model Cluster (on-premises/on-demand)
some
library
Machine Learning in Finance. Infrastructure for Data Scientist
Low HighCost of deployment/ownership
Distributed
FS
Dark
Magic…
ML as a Service
Python/R
tools
Microsoft Azure + R. Infrastructures for Data Scientists

Microsoft Azure + R. Microsoft ♥ R
R Server for Azure HDInsight
Data Science VM
Azure Machine Learning
Support R-scripts execution
Allow authoring custom R modules
Jupyter Notebooks with R kernel support
Azure HDInsight
Hadoop/Spark-cluster as a Service
SQL Server R Services
Power BI
Running R Scripts & excellent visualization
R Tools for Visual Studio
Microsoft
Azure

References
Reference: http://www.r-bloggers.com/using-microsoft-r-server-to-address-scalability-issues-in-r/
Microsoft Azure + R. Microsoft R Server Platform

References
Reference: http://www.r-bloggers.com/using-microsoft-r-server-to-address-scalability-issues-in-r/
Microsoft Azure + R. DistributedR: write once, deploy anywhere

R Server for Azure HDInsight
Killer features list:
100% open source R implementation;
workload running inside HDInsight (Hadoop/Spark).
Microsoft Azure + R. R Server for Azure HDInsight

R, Python, SQL, C#
Microsoft Azure + R. Data Science VM
Microsoft R Server Developer Edition,
Anaconda Python distribution,
Jupyter notebooks for Python and R,
Visual Studio Community Edition with Python and R Tools,
Power BI desktop,
SQL Server Express edition
ML libs: CNTK, xgboost and Vowpal Wabbit
Azure SDK
Data Science VM inside:

R Tools in Azure Machine Learning:
Support R-scripts execution;
Allow authoring custom R modules;
Jupyter Notebooks with R kernel support.
Microsoft Azure + R. Azure Machine Learning

Microsoft Azure + R. Azure Machine Learning
Jupyter
Notebook
Azure ML
Studio
GitHub/
TFS in Azure
h(θ0, θn)
Commands flow
Data flow
Request/response flow

References
Cortana Intelligence and Machine Learning Blog
R for Azure Machine Learning. Quickstart
Machine Learning Algorithm Cheat Sheet
Machine Learning Hackathon. How to win?
Azure ML Repositories on GitHub
Microsoft Azure for all group on Facebook
Soon in Slack (invite form)
Microsoft Azure + R. References

© 2016 Dmitry Petukhov All rights reserved. Microsoft and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
Data Science must win!

Q&A
Now or later (send on d.petukhov@outlook.com)
Ping me
Habr: @codezombie
LinkedIn: @dpetukhov
Facebook: @code.zombi
Read my tech code instinct blog ( http://0xCode.in/ )
Microsoft Azure + R. Stay in Touch!

Earth Science data is measured in petabytes and represents decades of data collection, evolution of technology and practices, and provides an unparalleled view of our planet. The pace of change is only accelerating: NASA and other agencies are on their way to making hundreds of Petabytes of data available in the cloud, highly scalable processing and analysis architectures and tools are in active use with more being developed every day, and each of these brings with it opportunities for optimization and innovation. This talk demonstrates leveraging the elastic nature of the cloud using GOES-16 data to create ephemeral Archives of Convenience, targeting individual researcher needs, optimized for their problems and tool suites, instead of trying to settle on a single "cloud optimized" solution.

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...

Wes McKinney

Spark Summit EU 2015: Reynold Xin Keynote

Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...

Gary Stafford

We will read and write messages to and from Amazon MSK in Apache Avro format. We will store the Avro-format Kafka message’s key and value schemas in Apicurio Registry and retrieve the schemas instead of hard-coding the schemas in the PySpark scripts. We will also use the registry to store schemas for CSV-format data files. Link to the blog post and video: https://itnext.io/stream-processing-with-apache-spark-kafka-avro-and-apicurio-registry-on-amazon-emr-and-amazon-13080defa3be

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders). Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.

h2oensemble with Erin Ledell at useR! Aalborg

Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...

Domino Data Lab

20161215 python pandas-spark四方山話

20170210 sapporotechbar7

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

20171012 found IT #9 PySparkの勘所

c,c++,java and python in gis development

H2O Rains with Databricks Cloud - NY 02.16.16

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.

SAREF in the InterConnect project - ICTOpen 2022

RonaldSiebes2

Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...

Data production continues to scale up and the techniques for managing it need to scale too. Building pipelines that can process petabytes per day in turn create data lakes with exabytes of historical data. At Databricks, we help our customers turn these data lakes into gold mines of valuable information using Apache Spark. This talk will cover techniques to optimize access to these data lakes using Delta Lakes, including range partitioning, file-based data skipping, multi-dimensional clustering, and read-optimized files. We'll cover sample implementations and see examples of querying petabytes of data in seconds, not hours. We'll also discuss tradeoffs that data engineers deal with everyday like read speed vs. write throughput, managing storage costs, and duplicating data to support multiple query profiles. We'll also discuss combining batch with streaming to achieve desired query performance. After this session, you will have new ideas for managing truly massive Delta Lakes.

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show: 1) How we tried to solve this problem using traditional DW techniques 2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients. 3) Some of the key learnings we had when migrating from DW to Spark.

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark

Jan Wiegelmann

Automated Production Ready ML at Scale

In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.

Python in geospatial analysis

Neo4j vs giraphNishant Gandhi

Data science in ruby, is it possible? is it fast? should we use it?

Rodrigo Urubatan

Python Programming and GIS

John Reiser

A MAC URISA event. This talk is oriented to GIS users looking to learn more about the Python programming language. The Python language is incorporated into many GIS applications. Python also has a considerable installation base, with many freely available modules that help developers extend their software to do more. The beginning third of the talk discusses the history and syntax of the language, along with why a GIS specialist would want to learn how to use the language. The middle of the talk discusses how Python is integrated with the ESRI ArcGIS Desktop suite. The final portion of the talk discusses two Python projects and how they can be used to extend your GIS capabilities and improve efficiency. Recording of the talk: https://www.youtube.com/watch?v=F1_FqvbXHb4

running R on Azure cloud

Kush Ohri

R + Apache Spark

What's hot

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

h2oensemble with Erin Ledell at useR! Aalborg

Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...

Domino Data Lab

20161215 python pandas-spark四方山話

20170210 sapporotechbar7

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

20171012 found IT #9 PySparkの勘所

c,c++,java and python in gis development

H2O Rains with Databricks Cloud - NY 02.16.16

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

SAREF in the InterConnect project - ICTOpen 2022

RonaldSiebes2

Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark

Jan Wiegelmann

Automated Production Ready ML at Scale

Python in geospatial analysis

Neo4j vs giraphNishant Gandhi

Data science in ruby, is it possible? is it fast? should we use it?

Rodrigo Urubatan

Python Programming and GIS

John Reiser

What's hot (20)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

h2oensemble with Erin Ledell at useR! Aalborg

Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...

20161215 python pandas-spark四方山話

20170210 sapporotechbar7

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

20171012 found IT #9 PySparkの勘所

c,c++,java and python in gis development

H2O Rains with Databricks Cloud - NY 02.16.16

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

SAREF in the InterConnect project - ICTOpen 2022

Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark

Automated Production Ready ML at Scale

Python in geospatial analysis

Neo4j vs giraph

Data science in ruby, is it possible? is it fast? should we use it?

Python Programming and GIS

Viewers also liked

running R on Azure cloud

Kush Ohri

R + Apache Spark

Introduction to R

running R on Azure cloudKush Ohri

Machine Learning in Microsoft Azure

Доклад посвящен экосистеме Cortana Analytics Suite, в т.ч. сервису предиктивной аналитики Azure Machine Learning. В demo-части доклада разбирается задача анализа тональности сообщений в социальных сетях. Видео выступления и пояснения к demo-доклада доступно на http://0xcode.in/dev-camp

AI for Retail Banking

Data quality and data profiling

Shailja Khurana

Viewers also liked (7)

running R on Azure cloud

R + Apache Spark

Introduction to R

running R on Azure cloud

Machine Learning in Microsoft Azure

AI for Retail Banking

Data quality and data profiling

Similar to Microsoft Azure + R

DF1 - ML - Petukhov - Azure Ml Machine Learning as a Service

MoscowDataFest

Microsoft & Machine Learning / Artificial Intelligence

İbrahim KIVANÇ

Microsoft AI Platform Overview

David Chou

Borys Rybak “How to make your data smart with Artificial Intelligence and Mac...

Lviv Startup Club

The Microsoft AI platform: a State of the Union

Microsoft Tech Community

Join Joseph Sirosh, Corporate Vice President of the Cloud AI Platform, for a deep dive into the AI platform and exciting AI use cases. Joseph will showcase how every developer can infuse intelligence into their applications and create amazing new experiences with AI. In this exciting overview, you will learn about the application of AI technologies in the cloud. We will help you understand how to add pre-built AI capabilities like object detection, face understanding, translation and speech to applications. We will show how developers can build Cognitive Search applications that understand deep content in images, text and other data. We will also show how the platform can be used to build your own custom AI models for predictive applications and how to use the Azure platform to accelerate machine learning. Joseph will also show how companies assemble end-to-end systems of intelligence using the rich variety of data and application development services on Azure.

Py datanyc2015

rosettahub

Build Low-Latency Applications in Rust on ScyllaDB

ScyllaDB

Join us for a developer workshop where we’ll go hands-on to explore the affinities between Rust, the Tokio framework, and ScyllaDB. ScyllaDB is a perfect match for Rust. Similar to the Rust programming language and the Tokio framework, ScyllaDB is built on an asynchronous, non-blocking runtime that works extremely well for building highly-reliable low-latency distributed applications. In this workshop, you’ll go live with our sample Rust application, built on our new, high performance native Rust client driver. By compiling and walking through the code, you’ll learn specifically how to craft queries to a locally running ScyllaDB cluster. In the process you’ll discover the features and best practices that enable your Rust applications to squeeze maximum performance out of ScyllaDB's shard-per-core architecture. - Install and compile an IoT sample app, built on ScyllaDB’s native Rust SDK. - Install a single cluster of Scylla locally - Use Docker to get a 3-node cluster running on your laptop - Connect the application to the database - Review data modeling, query types and best practices - Manage and monitor If you’re an application developer with an interest in Rust and Tokio, this workshop is for you!

PPT5: Neuron Introduction

akira-ai

.NET for Azure Synapse (and viceversa)

Marco Parenzan

Bhadale group of companies our technology ecosystem

Vijayananda Mohire

Tour de France Azure PaaS 6/7 Ajouter de l'intelligence

Alex Danvy

Nous assisterons probablement à une rupture générationnelle entre les apps avec de l'intelligence artificielle et celles sans. Ces dernières, comme les applications en mode caractères à l'arrivée des interfaces graphiques, auront du mal à perdurer. Azure met à dispositions 3 approches pour ajouter de l'IA dans une app, avec un niveau de difficulté graduel, de l'outil ne nécessitant aucune compétence particulière à celui dédié aux Data Scientistes.

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Michael Rys

AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight

Łukasz Grala

201905 Azure Databricks for Machine Learning

Mark Tabladillo

Microsoft Technologies for Data Science 201612

Mark Tabladillo

Delivered to SQL Saturday BI Edition -- Atlanta, GA Microsoft provides several technologies in and around Azure which can be used for casual to serious data science. This presentation provides an overview of the major Microsoft options for both on-premise and cloud-based data science (and hybrid). These technologies have been used by the presenter in various companies and industries, both as a Microsoft consultant and previously independent consultant. As well, the speaker provides insights into data science careers, information which helps imply where the business will likely be for consultants and partners.

Microsoft Power BI and Cortana Analytics user group meetings with Alteryx

Håkan Söderbom

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Michael Rys

Azure Machine Learning