Data Science in the cloud with Microsoft Azure

Data Science in the cloud
with
Microsoft Azure
MARTIN THORNALLEY
DATA SOLUTION ARCHITECT, MICROSOFT

Data Science Definition
“Data science is an interdisciplinary field about
processes and systems to extract knowledge or
insights from data in various forms, either structured
or unstructured, which is a continuation of some of
the data analysis fields such as statistics, machine
learning, data mining, and predictive analytics”
https://en.wikipedia.org/wiki/Data_science

Data Science Skillset
http://berkeleysciencereview.com/how-to-become-a-data-scientist-before-you-graduate/

The Cloud
Why does the Cloud matter for Data Science?
 High capacity and cost effective data storage
 Flexible, elastic compute capacity
 Ready to use technologies
 Choice of Infrastructure or Platform
 Enables Agile & DevOps
 Operational reliability and security
 Pay as you go

Microsoft Azure Cloud Platform
 Wide range of services covering Compute, Web & Mobile, Data &
Storage, Analytics, Internet of Things & Intelligence plus many more,
see http://azureplatform.azurewebsites.net/en-us/
 Easy to get started, free to try for 30 days but limited spend, also
MSDN licence free credits, see https://azure.microsoft.com/en-
gb/free/
 Comprehensive documentation and examples
 Global presence with many recognisable brands fully committed
 Huge investment and growing rapidly

Data Science Process
https://azure.microsoft.com/en-us/documentation/articles/data-science-process-overview/

NYC taxis
 2013 NYC taxi trips and fares – open but non-trivial dataset
 24 CSV files - 12 trip, 12 fare, 1 for each month
 ~20GB compressed, ~50GB uncompressed, 170+ million records
 medallion – vehicle identifier
 hack license – driver identifier
 passenger count
 pickup & dropoff – datetime, longitude, latitude
 trip – time and distance
 fare - payment type, fare amount, surcharge, mta tax, tip amount, tolls
amount, total amount
http://www.andresmh.com/nyctaxitrips/

Predictions
 Predict whether a specific journey will result in a tip – binary
classification
 Predict what class of tip will be for a specific journey – multiclass
classification
 Predict how much a tip will be for a specific journey – regression

Data Science Virtual Machine
Create Linux and Windows virtual machines in minutes
 Wide range of configurations - CPU cores, memory, disks, network
speeds
 Scale to what you need
 Pay only for what you use
 Enhance security and compliance
 Preloaded with full set of tools and utilities from Azure MarketPlace
e.g. SQL Server 2016 Developer edition, Azure SDK, Python, R,
Jupyter, etc.

Storage Accounts
Massively scalable cloud storage for your applications
 Security-enhanced, durable, and highly available across the globe
 Industry-leading performance with exabytes of capacity
 Pay only for what you use
 Open, multi-platform support

HDInsight
A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service
made easy
 Scale to petabytes on demand
 Crunch all data—structured, semi-structured, unstructured
 Skip buying and maintaining hardware
 Spin up Apache Hadoop, Spark, and R clusters in the cloud
 Use Excel or your favourite BI tool to visualize Hadoop data
 Connect on-premises Hadoop clusters with the cloud

Azure Machine Learning
A fully managed cloud service that enables you to easily build, deploy,
and share predictive analytics solutions.
 Powerful cloud based analytics, now part of Cortana Intelligence
Suite
 Azure Machine Learning Studio includes hundreds of built-in
packages and support for custom code
 Share your solution with the world in the Gallery or on the Azure
Marketplace

Preparation & Exploration
 Copy data using Azcopy and decompress
 Inspect files and load in to RStudio
 Create external Hive tables and load
 Query over full dataset for further exploration
 Remove erroneous data e.g. passenger numbers, lat/long
 Engineer features using Hive
 Distance from start to finish using Haversine calculation
 Binary indicator for tips
 Tip level based on ranges for multiclass classification
 Downsample dataset and save as internal table for Machine Learning

Machine Learning & Deployment
 Import Data using Hive Query
 Build Training Experiments
 Evaluate model performance
 Create Predictive Experiments
 Publish Web Service
 Test Web Service
 Call from Excel

Next Steps
To build a fully fledged enterprise solution with regular data ingestion
and model execution consider the following:
 Data Catalog
 Data Factory
 Event Hubs & Stream Analytics
 Power BI
 Cognitive Services

Summary
 Microsoft Azure provides a wide range of technologies for Data
Science activities
 Platform services reduce the management overhead
 No capacity limitations and flexible provisioning – pay as you go
 Choice of Open Source and Microsoft – use the best tool for the task
 The tools are well integrated
 Azure Machine Learning makes it trivial to deploy your models
 It’s quick and easy to get started

Getting Started
 Sign up for free
https://azure.microsoft.com/en-gb/free/
 Create a Data Science VM
https://azure.microsoft.com/en-us/marketplace/partners/microsoft-
ads/standard-data-science-vm/
 Visit Cortana Intelligence Gallery
https://gallery.cortanaintelligence.com/

Thank You
Martin Thornalley
Data Solution Architect, Microsoft
@mthornal
martin.thornalley@microsoft.com
https://www.linkedin.com/in/martinthornalley

Data Science in the cloud with Microsoft Azure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Data Science in the cloud with Microsoft Azure

Similar to Data Science in the cloud with Microsoft Azure (20)

More from TechExeter

More from TechExeter (20)

Recently uploaded

Recently uploaded (20)

Data Science in the cloud with Microsoft Azure