SlideShare a Scribd company logo
The Developer Data
Scientist Creating New Analytics Driven
Applications
Using Azure Databricks® and Apache Spark™
About Richard
Richard Garris
● Principal Solutions Architect
● 14+ years in data management and
advanced analytics
● advises customers on their data
science and advanced analytic
projects
● Degrees from The Ohio State
University and Carnegie Mellon
University
2
Agenda
- Introduction to Data Science
- Data Science Lifecycle
- Data Ingestion
- Data Understanding & Exploration
- Modeling
- Integrating Machine Learning in Your Application
- End-to-End Example Use Cases
3
Introduction to Data
Science
4
AI is Changing the World
What is the secret to AI?
AlphaGoSelf-driving cars Alexa
AI is Changing the World
What do these companies have in common?
AlphabetTesla Amazon
Hardest Part of AI isn’t AI, its Big Data
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS
2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by
the small green box in the middle. The required surrounding infrastructure is vast and complex.
Business Value of Data Science
Present the
Right Offer at
the Right Time
•Businesses have to Adapt Faster to
Change
•Data driven decisions need to be
made quickly and accurately
•Customers expect faster responses
Data Science Lifecycle
9
Agile Modeling Process
Set Business
Goals
Understand Your
Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Data Scientists or Data Janitors?
1
1
“3 out of 5 data
scientists spend
80% of their time
collecting, cleaning
and organizing
data”
Data Understanding
Schema - understand the field names / data types
Metadata Management - understand descriptions and business
meaning
Data Quality - data validation / profiling / checks
Exploration / Visualization - scatter plots, charts, correlations
Summary Statistics - average, min, max, range, median,
standard deviation
Data Understanding - Visualization
Data Understanding
Descriptive Statistics
Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis
Relationship Statistics
Pearson’s Coefficient, Spearman correlation,
Data Understanding
Structured
• Key Value
• Tabular (Relational,
Nested)
• Graph
• Geocoded / Location
• Time Series
Unstructured
• Text (logs, tweets, articles)
• Sound / Waveform
• Sensor
• Genomic / Scientific Data
Structured Data (relational, tabular,
nested)
Text (logs, tweets, articles, social)
Graph Data (connections, social)
Start with the Raw Data
Geocoded / Location Data
Time Series Data
Stock Charts
Streaming / Events
Sound and Waveform Data
Sensor Data
Images and Video
Genomic & Scientific Data
What is a Data Science Platform?
Gartner defines Data Science Platform :
“an end-to-end platform for developing and
deploying models”
Using sophisticated statistical models, machine learning,
neural networks, text analytics, and other advanced data
mining techniques
25
What is a Model
A simplified and idealized
representation of the real-world
What does Modeling Mean?
A Class is a Model Model of a Building Data Model
class Employee {
FirstName : String
LastName : String
DOB : java.calendar.Date
Grades : Seq[Grade]
}
Types of Models
Machine Learning Models
Statistical Models
Financial Models
Graph Models
Simulation Models
Predictive Models
Biological Models
Two Broad Categories of Models
●Supervised learning: prediction
Classification (binary or multiclass): predict a category (label)
Regression: predict a number (target)
●Unsupervised learning: discovery
Clustering: find groupings based on pattern
Density estimation: match data with distribution pattern
Dimensionality: reduction / reduce # of columns
Similarity search: find similar data
Frequent Items (or association rules): finding relationships in variables
Model Category Use Cases
●Anomaly detection
Density estimation: “Is this observation uncommon?”
Similarity search: “How far is it from other observations?”
Clustering: “Are there groups of strange observations?”
●Lead scoring / recommendation
Classification: “Will this user become a buyer?”
Regression: “How much will he/she spend?”
Similarity search: “What products did similar users buy?”
A Model is a Mathematical Function
Unsupervised Methods in MLlib
Clustering
●Gaussian mixture models
●K-Means
●Streaming K-Means
●Latent Dirichlet Allocation
●Power Iteration Clustering
Frequent itemsets
●FP-growth
●Prefix span
Recommendation
●Alternating Least Squares
Supervised Methods in MLlib
Classification
Logistic regression w/ elastic net
Naive Bayes
Streaming logistic regression
Linear SVMs
Decision trees
Random forests
Gradient-boosted trees
Multilayer perceptron
One-vs-rest
DeepImagePredictor
Regression
Least squares w/ elastic net
Isotonic regression
Decision trees
Random forests
Gradient-boosted trees
Streaming linear methods
But What is a Model Really?
A model is a complex pipeline of components
Data Sources
Joins
Featurization Logic
Algorithm(s)
Transformers
Estimators
Tuning Parameters
ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!
Integrating Machine
Learning in Your
Application
37
Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
Problems with Productionizing
Models
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
MLLib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLLib 2.X Model Serialization
Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!
Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_s
trIdx_bb9728f85745/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet(”/mod
els/lr/stages/00_strIdx_bb9728f85745/
data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)
Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_l
ogreg_325fa760f925/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet("/mod
els/lr/stages/18_logreg_325fa760f925/
data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
Visualize Stage (DecisionTree)
Visualization of the Tree
In Databricks
Databricks + ML Pipelines: Ideal
Modeling Tool
Data Science - highly iterative,
agile
● Lots of data sources
● Lots of dirty data
● Lots and lots of data
ML Pipelines and notebooks are
ideal way to experiment with new
methods, data, features in order to
minimize error
Databricks Runtime
Elastic, Fully Managed, Highly Tuned Engine
48
FULLY MANAGED CLOUD
SERVICE
• Auto-configured multi-user elastic
clusters
• Reliable sharing with fault
isolation and workload
Preemption
PERFORMANCE OPTIMIZATIONS
• Increases performance by 5X
(TPC Benchmark)
• Connector optimizations
for Cloud ( Kafka, S3 and
Kinesis)
COST OPTIMIZED / LINEAR
SCALING
• 2x nodes - time cut in half
• 2x data, 2x nodes - time constant
• Cost of 10 nodes for 10 hours
equal to 100 nodes for 1 hour
DATABRICKS UNIFIED RUNTIME
Databricks I/O Databricks Serverless
Databricks Collaborative Workspace
Frictionless Collaboration Enabling Faster Innovation
49
Secure collaboration for fast feedback loops with single click access to
clusters
Production Jobs
FAST, RELIABLE AND SECURE JOBS
• Executes jobs 30-50% faster
• Notebooks to Production Jobs with one-
click
• Debug faster with logs and Spark history
UI.
DATA ENGINEER
ANALYZE DATA WITH NOTEBOOKS
• Multi-language: SQL, R, Scala, Python
• Advanced Analytics (Graph, ML & DL)
• Built-in visualization, including D3 & ggplot
DATA SCIENTIST
Interactive
Notebooks
BUILD DASHBOARDS
• Publish Insights
• Real-time updates
• Interactive reportsBUSINESS SME
Dashboards
49
50
Databricks’ Approach to Accelerate Innovation
INCREASE PERFORMANCE
By more than 5x and reduce TCO by
more than 70%
INCREASE PRODUCTIVITY
Of data science teams by 4-5x
STREAMLINE ANALYTIC
WORKFLOWS
Reducing deployment time to minutes
REDUCE RISK
And enable innovation with out-of-the-box
enterprise security and compliance
UNIFY ANALYTICS WITH APACHE
SPARK
Eliminating disparate tools
DATA SCIENTIST
/ANALYST
BUSINESS SMEDATA
ENGINEER
OPTIMIZEBIG DATA
CLUSTERS
SETUP
BREAK-FIX
DATA
WAREHOUSES
CLOUD
STORAGE
HADOOP STORAGEIoT / STREAMING DATA
MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD
ML
LIBRARIES
STREAMIN
G
STATISTICS
PACKAGES
ETL SQL
Unified Engine
• SQL
• Streaming
• MLlib
• Graph
Databricks Optimizations
and Managed Cloud Service
DATABRICKS ENTERPRISE
SECURITY
DATABRICKS COLLABORATIVE
WORKSPACE
Databricks
Production
Databricks Interactive
DATABRICKS
RUNTIME
Databricks I/ODatabricks Serverless Unified Engine
Open APIs
End-to-end Examples
51
Demonstrations
•Predicting Power Output for an Energy Company
•Scoring inbound Leads
•Predicting Ratings given Reviews from Amazon
5
2
Demonstration
Databricks Community Edition or Free Trial
<Link to Azure>
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
5
3

More Related Content

What's hot

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
Amazon Web Services
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
Michael Stephenson
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
Koray Kocabas
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azure
Mohamed Tawfik
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Tomasz Kopacz
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
rustd
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Eduardo Castro
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 

What's hot (20)

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azure
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 

Similar to Azure Databricks for Data Scientists

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
MLconf
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
Cambridge Semantics
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
VMware Tanzu
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
VMware Tanzu
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Guido Schmutz
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
Cambridge Semantics
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 

Similar to Azure Databricks for Data Scientists (20)

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Azure Databricks for Data Scientists

  • 1. The Developer Data Scientist Creating New Analytics Driven Applications Using Azure Databricks® and Apache Spark™
  • 2. About Richard Richard Garris ● Principal Solutions Architect ● 14+ years in data management and advanced analytics ● advises customers on their data science and advanced analytic projects ● Degrees from The Ohio State University and Carnegie Mellon University 2
  • 3. Agenda - Introduction to Data Science - Data Science Lifecycle - Data Ingestion - Data Understanding & Exploration - Modeling - Integrating Machine Learning in Your Application - End-to-End Example Use Cases 3
  • 5. AI is Changing the World What is the secret to AI? AlphaGoSelf-driving cars Alexa
  • 6. AI is Changing the World What do these companies have in common? AlphabetTesla Amazon
  • 7. Hardest Part of AI isn’t AI, its Big Data ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.
  • 8. Business Value of Data Science Present the Right Offer at the Right Time •Businesses have to Adapt Faster to Change •Data driven decisions need to be made quickly and accurately •Customers expect faster responses
  • 10. Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 11. Data Scientists or Data Janitors? 1 1 “3 out of 5 data scientists spend 80% of their time collecting, cleaning and organizing data”
  • 12. Data Understanding Schema - understand the field names / data types Metadata Management - understand descriptions and business meaning Data Quality - data validation / profiling / checks Exploration / Visualization - scatter plots, charts, correlations Summary Statistics - average, min, max, range, median, standard deviation
  • 13. Data Understanding - Visualization
  • 14. Data Understanding Descriptive Statistics Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis Relationship Statistics Pearson’s Coefficient, Spearman correlation,
  • 15. Data Understanding Structured • Key Value • Tabular (Relational, Nested) • Graph • Geocoded / Location • Time Series Unstructured • Text (logs, tweets, articles) • Sound / Waveform • Sensor • Genomic / Scientific Data
  • 16. Structured Data (relational, tabular, nested)
  • 17. Text (logs, tweets, articles, social)
  • 19. Start with the Raw Data Geocoded / Location Data
  • 20. Time Series Data Stock Charts Streaming / Events
  • 25. What is a Data Science Platform? Gartner defines Data Science Platform : “an end-to-end platform for developing and deploying models” Using sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques 25
  • 26. What is a Model A simplified and idealized representation of the real-world
  • 27. What does Modeling Mean? A Class is a Model Model of a Building Data Model class Employee { FirstName : String LastName : String DOB : java.calendar.Date Grades : Seq[Grade] }
  • 28. Types of Models Machine Learning Models Statistical Models Financial Models Graph Models Simulation Models Predictive Models Biological Models
  • 29. Two Broad Categories of Models ●Supervised learning: prediction Classification (binary or multiclass): predict a category (label) Regression: predict a number (target) ●Unsupervised learning: discovery Clustering: find groupings based on pattern Density estimation: match data with distribution pattern Dimensionality: reduction / reduce # of columns Similarity search: find similar data Frequent Items (or association rules): finding relationships in variables
  • 30. Model Category Use Cases ●Anomaly detection Density estimation: “Is this observation uncommon?” Similarity search: “How far is it from other observations?” Clustering: “Are there groups of strange observations?” ●Lead scoring / recommendation Classification: “Will this user become a buyer?” Regression: “How much will he/she spend?” Similarity search: “What products did similar users buy?”
  • 31. A Model is a Mathematical Function
  • 32. Unsupervised Methods in MLlib Clustering ●Gaussian mixture models ●K-Means ●Streaming K-Means ●Latent Dirichlet Allocation ●Power Iteration Clustering Frequent itemsets ●FP-growth ●Prefix span Recommendation ●Alternating Least Squares
  • 33. Supervised Methods in MLlib Classification Logistic regression w/ elastic net Naive Bayes Streaming logistic regression Linear SVMs Decision trees Random forests Gradient-boosted trees Multilayer perceptron One-vs-rest DeepImagePredictor Regression Least squares w/ elastic net Isotonic regression Decision trees Random forests Gradient-boosted trees Streaming linear methods
  • 34. But What is a Model Really? A model is a complex pipeline of components Data Sources Joins Featurization Logic Algorithm(s) Transformers Estimators Tuning Parameters
  • 35. ML Pipelines Train model Evaluate Load data Extract features A very simple pipeline
  • 36. ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline!
  • 37. Integrating Machine Learning in Your Application 37
  • 38. Productionizing Models Today Data Science Data Engineering Develop Prototype Model using Python/R Re-implement model for production (Java)
  • 39. Problems with Productionizing Models Develop Prototype Model using Python/R Re-implement model for production (Java) - Extra work - Different code paths - Data science does not translate to production - Slow to update models Data Science Data Engineering
  • 40. MLLib 2.X Model Serialization Data Science Data Engineering Develop Prototype Model using Python/R Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 41. Scala val lrModel = lrPipeline.fit(dataset) // Save the Model lrModel.write.save("/models/lr") • MLLib 2.X Model Serialization Snippet Python lrModel = lrPipeline.fit(dataset) # Save the Model lrModel.write.save("/models/lr") •
  • 42. Model Serialization Output Code // List Contents of the Model Dir dbutils.fs.ls("/models/lr") • Output Remember this is a pipeline model and these are the stages!
  • 43. Transformer Stage (StringIndexer) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/00_s trIdx_bb9728f85745/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/mod els/lr/stages/00_strIdx_bb9728f85745/ data/")) Output { "class":"org.apache.spark.ml.feature.StringIndexerModel", "timestamp":1488120411719, "sparkVersion":"2.1.0", "uid":"strIdx_bb9728f85745", "paramMap":{ "outputCol":"workclassIdx", "inputCol":"workclass", "handleInvalid":"error" } } Metadata and params Data (Hashmap)
  • 44. Estimator Stage (LogisticRegression) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/18_l ogreg_325fa760f925/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet("/mod els/lr/stages/18_logreg_325fa760f925/ data/")) Output Model params Intercept + Coefficients {"class":"org.apache.spark.ml.classification.LogisticRegressionModel", "timestamp":1488120446324, "sparkVersion":"2.1.0", "uid":"logreg_325fa760f925", "paramMap":{ "predictionCol":"prediction", "standardization":true, "probabilityCol":"probability", "maxIter":100, "elasticNetParam":0.0, "family":"auto", "regParam":0.0, "threshold":0.5, "fitIntercept":true, "labelCol":"label” }}
  • 45. Output Decision Tree Splits Estimator Stage (DecisionTree) Code // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/")) // Re-save as JSON sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
  • 46. Visualize Stage (DecisionTree) Visualization of the Tree In Databricks
  • 47. Databricks + ML Pipelines: Ideal Modeling Tool Data Science - highly iterative, agile ● Lots of data sources ● Lots of dirty data ● Lots and lots of data ML Pipelines and notebooks are ideal way to experiment with new methods, data, features in order to minimize error
  • 48. Databricks Runtime Elastic, Fully Managed, Highly Tuned Engine 48 FULLY MANAGED CLOUD SERVICE • Auto-configured multi-user elastic clusters • Reliable sharing with fault isolation and workload Preemption PERFORMANCE OPTIMIZATIONS • Increases performance by 5X (TPC Benchmark) • Connector optimizations for Cloud ( Kafka, S3 and Kinesis) COST OPTIMIZED / LINEAR SCALING • 2x nodes - time cut in half • 2x data, 2x nodes - time constant • Cost of 10 nodes for 10 hours equal to 100 nodes for 1 hour DATABRICKS UNIFIED RUNTIME Databricks I/O Databricks Serverless
  • 49. Databricks Collaborative Workspace Frictionless Collaboration Enabling Faster Innovation 49 Secure collaboration for fast feedback loops with single click access to clusters Production Jobs FAST, RELIABLE AND SECURE JOBS • Executes jobs 30-50% faster • Notebooks to Production Jobs with one- click • Debug faster with logs and Spark history UI. DATA ENGINEER ANALYZE DATA WITH NOTEBOOKS • Multi-language: SQL, R, Scala, Python • Advanced Analytics (Graph, ML & DL) • Built-in visualization, including D3 & ggplot DATA SCIENTIST Interactive Notebooks BUILD DASHBOARDS • Publish Insights • Real-time updates • Interactive reportsBUSINESS SME Dashboards 49
  • 50. 50 Databricks’ Approach to Accelerate Innovation INCREASE PERFORMANCE By more than 5x and reduce TCO by more than 70% INCREASE PRODUCTIVITY Of data science teams by 4-5x STREAMLINE ANALYTIC WORKFLOWS Reducing deployment time to minutes REDUCE RISK And enable innovation with out-of-the-box enterprise security and compliance UNIFY ANALYTICS WITH APACHE SPARK Eliminating disparate tools DATA SCIENTIST /ANALYST BUSINESS SMEDATA ENGINEER OPTIMIZEBIG DATA CLUSTERS SETUP BREAK-FIX DATA WAREHOUSES CLOUD STORAGE HADOOP STORAGEIoT / STREAMING DATA MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD ML LIBRARIES STREAMIN G STATISTICS PACKAGES ETL SQL Unified Engine • SQL • Streaming • MLlib • Graph Databricks Optimizations and Managed Cloud Service DATABRICKS ENTERPRISE SECURITY DATABRICKS COLLABORATIVE WORKSPACE Databricks Production Databricks Interactive DATABRICKS RUNTIME Databricks I/ODatabricks Serverless Unified Engine Open APIs
  • 52. Demonstrations •Predicting Power Output for an Energy Company •Scoring inbound Leads •Predicting Ratings given Reviews from Amazon 5 2
  • 53. Demonstration Databricks Community Edition or Free Trial <Link to Azure> Additional Questions? Contact us at http://go.databricks.com/contact-databricks 5 3