SlideShare a Scribd company logo
1 of 53
01 | Data Analytic Thinking
Cynthia Rudin | MIT Sloan School of Management
Stephen F Elston | Principle Consultant, Quantia Analytics, LLC
Data Analytic Thinking: Make decisions based on analysis of
data.
• Replace intuition with data driven analytical decisions
• Extract value from data assets
• Increase pace of actions
Data Analytic Thinking
Medical treatment:
• Formerly waited for patient to show symptoms
• Classify patients with high risk; take preventative action
• Similar to predictive maintenance applications
Data Analytic Example
Detect risking payment accounts:
• Formally performed manual retrospective analysis
• Real-time decision on account
• Classification widely used for anomaly detection
Data Analytic Example
Bicycle Rental Demand:
• Under stocking or over-stocking costly
• Manual forecasting difficult
• Forecast demand to optimize inventory
• Forecasting widely used
Data Analytic Example
Data Analytic Thinking:
• Replace intuition with data driven analytical decisions
• Transform raw data to valuable asset
• Increase pace of action
Data Analytic Thinking
01 | Overview of The Data Science Process
Cynthia Rudin | MIT Sloan School of Management
• Historical Notes on KDD, CRISP-DM, Big Data and Data Science
and their relationship to Data Mining and Machine Learning
• Example of the knowledge discovery process
Module Overview
Historical Notes on KDD, CRISP-DM, Big Data and
Data Science and their relationship to Data Mining
and Machine Learning
Historical Notes
• Term “Big Data” coined by astronomers Cox and Ellsworth in
1997
*From the Computing Community Consortium Big Data Whitepaper: http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf
CCC Big Data Pipeline from 2012*
Acquisition
/
Recording
Extraction
/
Cleaning
/
Annotation
Integration
/
Aggregation
/
Representation
Analysis
/
Modeling
Interpretation
Heterogeneity
Scale
Timeliness
Privacy
Human
Collaboration
Overall System
Historical Notes
• KDD (Knowledge Discovery in Databases) Process
Selection
Preprocessing
Transformation
Data Mining
Interpretation
/ Evaluation
Data
Target Data
Preprocessed
Data
Transformed
Data
Patterns
Information
Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996)
http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230
Historical Notes
• KDD (Knowledge Discovery in Databases) Process
Selection
Preprocessing
Transformation
Data Mining
Interpretation
/ Evaluation
Data
Target Data
Preprocessed
Data
Transformed
Data
Patterns
Information
Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996)
http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230
*CCC had no
citations to KDD
1996
CCC 2012
KDD 1996
Historical Notes
• KDD (Knowledge Discovery in Databases) Process
Selection
Preprocessing
Transformation
Data Mining
Interpretation
/ Evaluation
Data
Target Data
Preprocessed
Data
Transformed
Data
Patterns
Information
Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996)
http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230
Historical Notes
CRoss Industry Standard Process for Data Mining
(CRISP-DM)
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Evaluation Deployment
Identify project
objectives
Collect and review
data
Select and cleanse
data
Manipulate data and
draw conclusions
Evaluate model
and conclusions
Apply conclusions
to business
From 2000, 77 pages
Historical Notes
• The stages are basically the same no matter who
invents or reinvents the (knowledge discovery / data
mining / big data / data science) process. You may
not always need all the stages.
• Data science is an iterative process.
– Backwards arrows on most process diagrams.
Example of the Knowledge Discovery
Process
Knowledge Discovery Process Example
• I’ll walk you through the knowledge discovery
process with an example – the process of predicting
power failures in Manhattan.
Motivation for Example
• In NYC the peak demand for electricity is
rising.
• The infrastructure dates back to the 1880’s
from the time of Thomas Edison.
• Power failures occur fairly often (enough to
do statistics) and are expensive to repair
• We want to determine how to prioritize
manhole inspections in order to reduce the
number of manhole events (fires,
explosions, outages) in the future.
• This is a real problem.
• Opportunity Assessment & Business Understanding
• Data Understanding & Data Acquisition
• Data Preparation, including Cleaning and Transformation
• Model Building
• Policy Construction
• Evaluation, Residuals and Metrics
• Model Deployment, Monitoring, Model Updates
Stages in the knowledge discovery process
Opportunity Assessment & Business Understanding
What do you really want to accomplish and what are the constraints? What
are the risks? How will you evaluate the quality of the results?
• For manhole events the general goal was to “predict manhole fires and
explosions before they occur.” We made it more precise:
– Goal 1: Assess predictive accuracy for predicting manhole events in the
year before they happen.
– Goal 2: Create a cost-benefit analysis for inspection policies that takes
into account the cost of inspections and manhole fires. Determine how
often manholes need to be inspected.
Data Understanding & Data Acquisition
Data were:
–Trouble tickets – free text documents typed by dispatchers
documenting problems on the electrical grid.
–Records of information about manholes
–Records of information about underground cables
–Electrical shock information tables
–Extra information about serious events
–Inspection reports
–Vented cover data
V’s of Big Data include
“Variety” and “Veracity”
How do you know
what data to trust?
Data Cleaning and Transformation
• Sometimes 99% of the work
Data Cleaning and Transformation
• Turn free text into structured information:
– Trouble tickets turned into a vector like:
• Serious / Less Serious / Not an Event
• Year
• Month
• Day
• Manholes involved
• …
• Try to integrate tables (create unique identifiers):
– If you join manholes to cables, half of the cable records disappear
Knowledge Discovery Process
Model Building
• Often predictive modeling, meaning machine learning or statistical
modeling
• If you want to answer a yes/no question, this is classification.
– For manholes, will the manhole explode next year? Y/N
• If you want to predict a numerical value, this is regression.
• If you want to group observations into similar-looking groups, this is
clustering.
• If you want to recommend someone an item (e.g., book/movie/product)
based on ratings data from customers, this is a recommender system.
• Note: There are many other machine learning problems.
Policy Construction
• How will your model be used to change policy?
– For manholes, how should we recommend changing the inspection
policy based on our model?
– Consider using social media and customer purchase data to determine
customer participation if Starbucks moves into New City. After the model
is created, how to optimize where the shops are located, how big they
are, and where the warehouses are located.
• Model building is predictive, Policy Construction is prescriptive.
Evaluation
• How do you measure the quality of the result? Evaluation can
be difficult if the data do not provide ground truth.
• For manhole events, we had engineers at Con Edison withhold
high quality recent data and conduct a blind test.
Deployment
• Getting a working proof of concept deployed stops 95%
percent of projects.
• Don’t bother doing the project in the first place if no one plans
to deploy it.*
• Keep a realistic timeline in mind. Then add several months.
• While the model is deployed it will need to be updated and
improved.
* Unless it’s fun.
Knowledge Discovery is an Iterative Process
Summary
• Several attempts to make the process of discovering knowledge
scientific
– KDD, CRISP-DM,CCC Big Data Pipeline
• All have very similar steps
– Data Mining is only one of those steps (but an important one)
Stephen F Elston| Principle Consultant, Quantia Analytics, LLC
• Overview of data science technology
• Tour of Azure ML Studio
• Azure ML Experiments
• Azure tables and data types
• R and Python for Azure ML
• SQL for Azure ML
• Machine learning example
Outline
Tools for Data Science
• Open source tools: R and Python
• Azure Machine Learning
– Easy to use
– Easy to deploy
• Data science stacks
Why Open-Source Tools?
• R and Python widely used in data science
• Highly interactive
• Good visualization
• Vast packages (libraries) of utilities and algorithms
• Excellent development environments
• Jupyter notebooks
• R and Python are widely used in data science
• Powerful open-source data science tools
• Both have advantages and disadvantages
• Python tends to be more systematic and faster
• R contains wider range of packages and capabilities
• R offers grammar of graphics
R or Python?
Cortana Intelligence Suite
Intelligence
Dashboards &
Visualizations
Information
Management
Big Data Stores Machine Learning
and Analytics
Cortana
Event Hubs
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Data Intelligence Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Bot
Framework
SQL Data
Warehouse
Data Catalog
Data Lake
Analytics
Data Factory
Machine
Learning
Data Lake Store
Cognitive
Services
Power BI
Data
Sources
Apps
Sensors
and
devices
Data
Why Azure ML?
• Easy to use
• Quickly deploy production solutions as
web services
• Models run in a highly scalable cloud environment
• Secure cloud environment for data and code
• Powerful, efficient built-in algorithms
• Extensible with, SQL, Python, and R
• Integration with Jupyter notebooks
Azure ML Free Tier Account
• Free Tier Account
• Unlimited time, with restricted priority
• Paid account provides full performance
• Experiments contain workflow
• Experiments constructed of modules
• Experiments in sharable workspace
• Modules transform data, compute models, score models, and
evaluate models
• Create custom modules with SQL, R and Python
• Deploy solutions as web services
Azure ML Studio
Data Passed from Module to Module
in Azure ML Tables
Module
Module
Col1, Col2, Col3, ………………………,ColN
Val11, Val12, Val13, ……………………,Val1N
ValM1, ValM2, ValM3, ………………, ValMN
………., ……….., ………., ……………………., ………
Rectangular table
N Columns
M Rows
Equal length columns
Azure ML Table Data Types
• Numeric; Floating Point
• Numeric: Integer
• Boolean
• String
• Categorical
• Date-time
• Time-Span
• Image
• Azure ML is a production environment
• Interactively develop and test in IDE
• Subset data as needed – download as .csv
• IDE has powerful editor and debugger
• Cut and paste code into Execute R/Python Script module to
test in Azure ML
• Jupyter notebooks
Developing and testing R and Python
Execute R Script
Azure ML Table R Device Port
myFrame <- maml.mapInputPort(1,2)
zip file
source("src/myScript.R")
maml.mapOutputPort(“myFrame")
print(“Hello world")
Execute Python Script
Azure ML Table Python Device Port
def azureml_main(inFrame1, inFrame2)
zip file
import my_package
return myFrame
print(“Hello world")
• Code tested in IDE should run in Azure ML, but…….
• If error occurs look at the error.log or output.log
• From R use print() function
• From Python use sys.stderr.write() from sys
Debugging R and Python in Azure ML
SQL in Azure ML
• SQL data I/O; Reader and Writer modules
• SQL transformation module
• SQL resources
Querying with Transact-SQL, Graeme Malcom and Geoff Allix,
https://www.edx.org/course/querying-transact-sql-microsoft-
dat201x-0
Formally, given training set (xi,yi) for i=1…n, we want to create a classification model
f that can predict label y for a new feature values x.
Classification
Feature 1
Feature
2
0
f(x)=0 f(x)<0
f(x)>0
f(x) = function(features)
L1
L2
Machine learning workflow
Input data
Data Transformation
Train Model
Define Model Split Data
Score (prediction)
Evaluate Model
©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the
U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related Content

Similar to Data Analytic Thinking: Replace intuition with data driven decisions

Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analyticsInbavalli Valli
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 

Similar to Data Analytic Thinking: Replace intuition with data driven decisions (20)

Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 

More from XanGwaps

MSITSytemDesignAndDesignasdsdsdsdsdsds.pptx
MSITSytemDesignAndDesignasdsdsdsdsdsds.pptxMSITSytemDesignAndDesignasdsdsdsdsdsds.pptx
MSITSytemDesignAndDesignasdsdsdsdsdsds.pptxXanGwaps
 
AdvanceDatabaseChapter6Advance Dtabases.pptx
AdvanceDatabaseChapter6Advance Dtabases.pptxAdvanceDatabaseChapter6Advance Dtabases.pptx
AdvanceDatabaseChapter6Advance Dtabases.pptxXanGwaps
 
C++ Inheritance.pptx
C++ Inheritance.pptxC++ Inheritance.pptx
C++ Inheritance.pptxXanGwaps
 
Chapter-OBDD.pptx
Chapter-OBDD.pptxChapter-OBDD.pptx
Chapter-OBDD.pptxXanGwaps
 
virtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxvirtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxXanGwaps
 
Java ConstructorsPPT.pptx
Java ConstructorsPPT.pptxJava ConstructorsPPT.pptx
Java ConstructorsPPT.pptxXanGwaps
 
AdvanceSQL.ppt
AdvanceSQL.pptAdvanceSQL.ppt
AdvanceSQL.pptXanGwaps
 
Object-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxObject-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxXanGwaps
 
mod5_cabling-LANsWANs-2.ppt
mod5_cabling-LANsWANs-2.pptmod5_cabling-LANsWANs-2.ppt
mod5_cabling-LANsWANs-2.pptXanGwaps
 
globodox-presentation-v14.ppsx
globodox-presentation-v14.ppsxglobodox-presentation-v14.ppsx
globodox-presentation-v14.ppsxXanGwaps
 
chapter-2.ppt
chapter-2.pptchapter-2.ppt
chapter-2.pptXanGwaps
 
Requirements Analysis.pptx
Requirements Analysis.pptxRequirements Analysis.pptx
Requirements Analysis.pptxXanGwaps
 
Java Constructors.pptx
Java Constructors.pptxJava Constructors.pptx
Java Constructors.pptxXanGwaps
 
Java Constructors.pptx
Java Constructors.pptxJava Constructors.pptx
Java Constructors.pptxXanGwaps
 
CS530-Tuesday01.ppt
CS530-Tuesday01.pptCS530-Tuesday01.ppt
CS530-Tuesday01.pptXanGwaps
 
Object-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxObject-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxXanGwaps
 
Functions of Operating System.pptx
Functions of Operating System.pptxFunctions of Operating System.pptx
Functions of Operating System.pptxXanGwaps
 
HCI 1 Module 2.pptx
HCI 1 Module 2.pptxHCI 1 Module 2.pptx
HCI 1 Module 2.pptxXanGwaps
 
The virtual box.pptx
The virtual box.pptxThe virtual box.pptx
The virtual box.pptxXanGwaps
 
Presentation (10).pptx
Presentation (10).pptxPresentation (10).pptx
Presentation (10).pptxXanGwaps
 

More from XanGwaps (20)

MSITSytemDesignAndDesignasdsdsdsdsdsds.pptx
MSITSytemDesignAndDesignasdsdsdsdsdsds.pptxMSITSytemDesignAndDesignasdsdsdsdsdsds.pptx
MSITSytemDesignAndDesignasdsdsdsdsdsds.pptx
 
AdvanceDatabaseChapter6Advance Dtabases.pptx
AdvanceDatabaseChapter6Advance Dtabases.pptxAdvanceDatabaseChapter6Advance Dtabases.pptx
AdvanceDatabaseChapter6Advance Dtabases.pptx
 
C++ Inheritance.pptx
C++ Inheritance.pptxC++ Inheritance.pptx
C++ Inheritance.pptx
 
Chapter-OBDD.pptx
Chapter-OBDD.pptxChapter-OBDD.pptx
Chapter-OBDD.pptx
 
virtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptxvirtualization-220403085202_Chapter1.pptx
virtualization-220403085202_Chapter1.pptx
 
Java ConstructorsPPT.pptx
Java ConstructorsPPT.pptxJava ConstructorsPPT.pptx
Java ConstructorsPPT.pptx
 
AdvanceSQL.ppt
AdvanceSQL.pptAdvanceSQL.ppt
AdvanceSQL.ppt
 
Object-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxObject-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptx
 
mod5_cabling-LANsWANs-2.ppt
mod5_cabling-LANsWANs-2.pptmod5_cabling-LANsWANs-2.ppt
mod5_cabling-LANsWANs-2.ppt
 
globodox-presentation-v14.ppsx
globodox-presentation-v14.ppsxglobodox-presentation-v14.ppsx
globodox-presentation-v14.ppsx
 
chapter-2.ppt
chapter-2.pptchapter-2.ppt
chapter-2.ppt
 
Requirements Analysis.pptx
Requirements Analysis.pptxRequirements Analysis.pptx
Requirements Analysis.pptx
 
Java Constructors.pptx
Java Constructors.pptxJava Constructors.pptx
Java Constructors.pptx
 
Java Constructors.pptx
Java Constructors.pptxJava Constructors.pptx
Java Constructors.pptx
 
CS530-Tuesday01.ppt
CS530-Tuesday01.pptCS530-Tuesday01.ppt
CS530-Tuesday01.ppt
 
Object-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptxObject-Oriented Systems Analysis and Design Using UML.pptx
Object-Oriented Systems Analysis and Design Using UML.pptx
 
Functions of Operating System.pptx
Functions of Operating System.pptxFunctions of Operating System.pptx
Functions of Operating System.pptx
 
HCI 1 Module 2.pptx
HCI 1 Module 2.pptxHCI 1 Module 2.pptx
HCI 1 Module 2.pptx
 
The virtual box.pptx
The virtual box.pptxThe virtual box.pptx
The virtual box.pptx
 
Presentation (10).pptx
Presentation (10).pptxPresentation (10).pptx
Presentation (10).pptx
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Data Analytic Thinking: Replace intuition with data driven decisions

  • 1. 01 | Data Analytic Thinking Cynthia Rudin | MIT Sloan School of Management Stephen F Elston | Principle Consultant, Quantia Analytics, LLC
  • 2. Data Analytic Thinking: Make decisions based on analysis of data. • Replace intuition with data driven analytical decisions • Extract value from data assets • Increase pace of actions Data Analytic Thinking
  • 3. Medical treatment: • Formerly waited for patient to show symptoms • Classify patients with high risk; take preventative action • Similar to predictive maintenance applications Data Analytic Example
  • 4. Detect risking payment accounts: • Formally performed manual retrospective analysis • Real-time decision on account • Classification widely used for anomaly detection Data Analytic Example
  • 5. Bicycle Rental Demand: • Under stocking or over-stocking costly • Manual forecasting difficult • Forecast demand to optimize inventory • Forecasting widely used Data Analytic Example
  • 6. Data Analytic Thinking: • Replace intuition with data driven analytical decisions • Transform raw data to valuable asset • Increase pace of action Data Analytic Thinking
  • 7. 01 | Overview of The Data Science Process Cynthia Rudin | MIT Sloan School of Management
  • 8. • Historical Notes on KDD, CRISP-DM, Big Data and Data Science and their relationship to Data Mining and Machine Learning • Example of the knowledge discovery process Module Overview
  • 9. Historical Notes on KDD, CRISP-DM, Big Data and Data Science and their relationship to Data Mining and Machine Learning
  • 10. Historical Notes • Term “Big Data” coined by astronomers Cox and Ellsworth in 1997 *From the Computing Community Consortium Big Data Whitepaper: http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf CCC Big Data Pipeline from 2012* Acquisition / Recording Extraction / Cleaning / Annotation Integration / Aggregation / Representation Analysis / Modeling Interpretation Heterogeneity Scale Timeliness Privacy Human Collaboration Overall System
  • 11. Historical Notes • KDD (Knowledge Discovery in Databases) Process Selection Preprocessing Transformation Data Mining Interpretation / Evaluation Data Target Data Preprocessed Data Transformed Data Patterns Information Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996) http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230
  • 12. Historical Notes • KDD (Knowledge Discovery in Databases) Process Selection Preprocessing Transformation Data Mining Interpretation / Evaluation Data Target Data Preprocessed Data Transformed Data Patterns Information Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996) http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230 *CCC had no citations to KDD 1996
  • 14. Historical Notes • KDD (Knowledge Discovery in Databases) Process Selection Preprocessing Transformation Data Mining Interpretation / Evaluation Data Target Data Preprocessed Data Transformed Data Patterns Information Based on content in “From Data Mining to Knowledge Discovery”, AI Magazine, Vol 17, No. 3 (1996) http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230
  • 15. Historical Notes CRoss Industry Standard Process for Data Mining (CRISP-DM) Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Identify project objectives Collect and review data Select and cleanse data Manipulate data and draw conclusions Evaluate model and conclusions Apply conclusions to business From 2000, 77 pages
  • 16. Historical Notes • The stages are basically the same no matter who invents or reinvents the (knowledge discovery / data mining / big data / data science) process. You may not always need all the stages. • Data science is an iterative process. – Backwards arrows on most process diagrams.
  • 17. Example of the Knowledge Discovery Process
  • 18. Knowledge Discovery Process Example • I’ll walk you through the knowledge discovery process with an example – the process of predicting power failures in Manhattan.
  • 19. Motivation for Example • In NYC the peak demand for electricity is rising. • The infrastructure dates back to the 1880’s from the time of Thomas Edison. • Power failures occur fairly often (enough to do statistics) and are expensive to repair • We want to determine how to prioritize manhole inspections in order to reduce the number of manhole events (fires, explosions, outages) in the future. • This is a real problem.
  • 20. • Opportunity Assessment & Business Understanding • Data Understanding & Data Acquisition • Data Preparation, including Cleaning and Transformation • Model Building • Policy Construction • Evaluation, Residuals and Metrics • Model Deployment, Monitoring, Model Updates Stages in the knowledge discovery process
  • 21. Opportunity Assessment & Business Understanding What do you really want to accomplish and what are the constraints? What are the risks? How will you evaluate the quality of the results? • For manhole events the general goal was to “predict manhole fires and explosions before they occur.” We made it more precise: – Goal 1: Assess predictive accuracy for predicting manhole events in the year before they happen. – Goal 2: Create a cost-benefit analysis for inspection policies that takes into account the cost of inspections and manhole fires. Determine how often manholes need to be inspected.
  • 22. Data Understanding & Data Acquisition Data were: –Trouble tickets – free text documents typed by dispatchers documenting problems on the electrical grid. –Records of information about manholes –Records of information about underground cables –Electrical shock information tables –Extra information about serious events –Inspection reports –Vented cover data V’s of Big Data include “Variety” and “Veracity” How do you know what data to trust?
  • 23. Data Cleaning and Transformation • Sometimes 99% of the work
  • 24. Data Cleaning and Transformation • Turn free text into structured information: – Trouble tickets turned into a vector like: • Serious / Less Serious / Not an Event • Year • Month • Day • Manholes involved • … • Try to integrate tables (create unique identifiers): – If you join manholes to cables, half of the cable records disappear
  • 26. Model Building • Often predictive modeling, meaning machine learning or statistical modeling • If you want to answer a yes/no question, this is classification. – For manholes, will the manhole explode next year? Y/N • If you want to predict a numerical value, this is regression. • If you want to group observations into similar-looking groups, this is clustering. • If you want to recommend someone an item (e.g., book/movie/product) based on ratings data from customers, this is a recommender system. • Note: There are many other machine learning problems.
  • 27. Policy Construction • How will your model be used to change policy? – For manholes, how should we recommend changing the inspection policy based on our model? – Consider using social media and customer purchase data to determine customer participation if Starbucks moves into New City. After the model is created, how to optimize where the shops are located, how big they are, and where the warehouses are located. • Model building is predictive, Policy Construction is prescriptive.
  • 28. Evaluation • How do you measure the quality of the result? Evaluation can be difficult if the data do not provide ground truth. • For manhole events, we had engineers at Con Edison withhold high quality recent data and conduct a blind test.
  • 29. Deployment • Getting a working proof of concept deployed stops 95% percent of projects. • Don’t bother doing the project in the first place if no one plans to deploy it.* • Keep a realistic timeline in mind. Then add several months. • While the model is deployed it will need to be updated and improved. * Unless it’s fun.
  • 30. Knowledge Discovery is an Iterative Process
  • 31. Summary • Several attempts to make the process of discovering knowledge scientific – KDD, CRISP-DM,CCC Big Data Pipeline • All have very similar steps – Data Mining is only one of those steps (but an important one)
  • 32. Stephen F Elston| Principle Consultant, Quantia Analytics, LLC
  • 33. • Overview of data science technology • Tour of Azure ML Studio • Azure ML Experiments • Azure tables and data types • R and Python for Azure ML • SQL for Azure ML • Machine learning example Outline
  • 34. Tools for Data Science • Open source tools: R and Python • Azure Machine Learning – Easy to use – Easy to deploy • Data science stacks
  • 35. Why Open-Source Tools? • R and Python widely used in data science • Highly interactive • Good visualization • Vast packages (libraries) of utilities and algorithms • Excellent development environments • Jupyter notebooks
  • 36. • R and Python are widely used in data science • Powerful open-source data science tools • Both have advantages and disadvantages • Python tends to be more systematic and faster • R contains wider range of packages and capabilities • R offers grammar of graphics R or Python?
  • 37. Cortana Intelligence Suite Intelligence Dashboards & Visualizations Information Management Big Data Stores Machine Learning and Analytics Cortana Event Hubs HDInsight (Hadoop and Spark) Stream Analytics Data Intelligence Action People Automated Systems Apps Web Mobile Bots Bot Framework SQL Data Warehouse Data Catalog Data Lake Analytics Data Factory Machine Learning Data Lake Store Cognitive Services Power BI Data Sources Apps Sensors and devices Data
  • 38.
  • 39. Why Azure ML? • Easy to use • Quickly deploy production solutions as web services • Models run in a highly scalable cloud environment • Secure cloud environment for data and code • Powerful, efficient built-in algorithms • Extensible with, SQL, Python, and R • Integration with Jupyter notebooks
  • 40. Azure ML Free Tier Account • Free Tier Account • Unlimited time, with restricted priority • Paid account provides full performance
  • 41. • Experiments contain workflow • Experiments constructed of modules • Experiments in sharable workspace • Modules transform data, compute models, score models, and evaluate models • Create custom modules with SQL, R and Python • Deploy solutions as web services Azure ML Studio
  • 42. Data Passed from Module to Module in Azure ML Tables Module Module Col1, Col2, Col3, ………………………,ColN Val11, Val12, Val13, ……………………,Val1N ValM1, ValM2, ValM3, ………………, ValMN ………., ……….., ………., ……………………., ……… Rectangular table N Columns M Rows Equal length columns
  • 43. Azure ML Table Data Types • Numeric; Floating Point • Numeric: Integer • Boolean • String • Categorical • Date-time • Time-Span • Image
  • 44.
  • 45. • Azure ML is a production environment • Interactively develop and test in IDE • Subset data as needed – download as .csv • IDE has powerful editor and debugger • Cut and paste code into Execute R/Python Script module to test in Azure ML • Jupyter notebooks Developing and testing R and Python
  • 46. Execute R Script Azure ML Table R Device Port myFrame <- maml.mapInputPort(1,2) zip file source("src/myScript.R") maml.mapOutputPort(“myFrame") print(“Hello world")
  • 47. Execute Python Script Azure ML Table Python Device Port def azureml_main(inFrame1, inFrame2) zip file import my_package return myFrame print(“Hello world")
  • 48. • Code tested in IDE should run in Azure ML, but……. • If error occurs look at the error.log or output.log • From R use print() function • From Python use sys.stderr.write() from sys Debugging R and Python in Azure ML
  • 49. SQL in Azure ML • SQL data I/O; Reader and Writer modules • SQL transformation module • SQL resources Querying with Transact-SQL, Graeme Malcom and Geoff Allix, https://www.edx.org/course/querying-transact-sql-microsoft- dat201x-0
  • 50.
  • 51. Formally, given training set (xi,yi) for i=1…n, we want to create a classification model f that can predict label y for a new feature values x. Classification Feature 1 Feature 2 0 f(x)=0 f(x)<0 f(x)>0 f(x) = function(features) L1 L2
  • 52. Machine learning workflow Input data Data Transformation Train Model Define Model Split Data Score (prediction) Evaluate Model
  • 53. ©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Editor's Notes

  1. 1
  2. 7
  3. 9
  4. Let’s start with historical notes. *Read* So it’s actually kind of an old term that got a bit hyped up a few years ago after the CCC whitepapers on Big Data came out. I’m showing you the CCC big data pipeline from the whitepaper in 2012. It’s nice, it’s an effort to formalize the scientific process that goes into discovering knowledge from data. It’s not just data mining, there’s a lot more to it. In fact, data mining is the 4th step out of 5. Major steps in analysis of big data are shown in the flow at the top. Below it are big data needs that make these tasks challenging. Let me go through the 5 steps. First you have to go through a process of actually obtaining the data. Perhaps that involves recording how users behave on different websites, so you have to write a script to collect the data properly. The second step is extraction/cleaning/annotation, where you try to reduce the noise a bit and get rid of the data you don’t directly need. Or if you’re working with free text that’s easy to annotate, you might be able to turn it into a structured table. Basically at this step, you’re turning the data from what is potentially a pile of rubbish into something you might actually be able to work with. At the third step you are going to set up the data in a way that is directly conducive to data mining. All the text documents are linke properly to the other structured tables, and you have one central database where the information is stored neatly. : Ok so that’s the big data pipeline. Now I’m going to show you another, separate big data pipeline, constructed by a separate group of people, at another time. So this is a second group of people trying to formalize the discovery of knowledge from data. Ready?
  5. Way before the CCC had “big data” as a twinkling in their eyes. And the CCC people did not cite the original KDD paper – they didn’t know about each other. Perhaps there is something fundmental here about this process. Maybe it’s like an important number, like pi or the golden ratio or something. This sort of universal process that people separately discover. Anyway, so this is the kdd process.
  6. selection, preprocessing, transformation. Does it look familiar? It should, because it’s the same as on the previous slide! Same exact steps. So you’re probably not surprised, so what, two groups came up with the same diagram. But look. This diagram is from 1996.
  7. It seems so abstract but there may be something fundamental about this. BTW reading both of these articles is really inspiring. One thing I want to point out again is that data mining is a part of the process.
  8. One part, not all of it. Granted, it’s like the climax of a story, but like any story, the set up of the story really makes the whole thing work. *pause*
  9. Now, KDD wasn’t the only game in town before the CCC whitepapers. This is CRISP-DM – Cross.... These guys were really serious about formalizing the process. They wrote a 77 page manual on how to do this stuff. I actually like CRISP-DMs process the best, because it includes business understanding and data understanding. I also love all the backwards arrows showing how many iterations you do within this process on a regular basis. Anyway, this is the process I tend to think of.
  10. Just to summarize
  11. 17
  12. Just to summarize
  13. See CRISP-DM for the 77 page description
  14. *read* These goals are concrete – we can form concrete assessments of how well we did with these tasks.
  15. Cables were not matched to manholes
  16. Before this step you might have a giant pile of garbage.
  17. I like to think of the knowledge discovery process sort of like alchemy. Alchemy is where you try to turn lead into gold. Except that alchemy doesn’t work, but data often does! Business understanding is like – am I really going to get gold out of this? What constitutes gold here Data understanding is like – what are my raw materials that I’m going to collect Data prep is an important step – that’s where you purify all the raw ingredients and get rid of all the junk and contaminants. Modeling is where you actually do the transformation of the raw ingredients into gold – that’s why people are obsessed with it, and why other people think it’s magic, but really, as long as you did the processing right, this step is generally pretty easy. And it does work like magic. But you’ll see it’s definitely not magic! Evaluation is where the appraiser comes and tells you how much the gold is worth, and of course deployment is where the whole village comes to buy your gold. Back to the modeling step.
  18. Predictive doesn’t necessarily mean in the future time-wise. It could mean prediction of a new circumstance you haven’t seen before.
  19. That’s how they knew they could trust us.
  20. It’s like “Hey I turned lead into gold” and the reaction is “well, that’s nice” Goals might change, data itself might change, data might change as a reaction to the new policy, you observe something that you didn’t expect and need to put it in, etc.
  21. *ad lib about iterative process* At least if you have this outline, it sort of helps you figure out what to charge people for your services. It’s not like you can just charge them for doing some modeling, you actually have to charge them for all parts of this knowledge discovery process and for the iterations you’re going to do to fix up the system. A proposal to work might actually include each of these steps in the process.
  22. 32
  23. T: Cortana Intelligence provides everything you need to transform your organization’s data into intelligent action. Next, let’s take a look at another demo.
  24. 38
  25. 44
  26. 50