Uploaded byThe Statistical and Applied Mathematical Sciences Institute

PPTX, PDF152 views

NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in Data Science and Machine Learning - Sudipta Dasmohapatra, March 4, 2019

This document provides an introduction to data science and machine learning. It discusses the era of big data and jobs in data science. It defines data science as an interdisciplinary field at the interface of statistics, computer science, and mathematics. The document describes supervised and unsupervised machine learning algorithms such as linear regression, decision trees, support vector machines, and clustering. It also discusses applications of machine learning like recommendations and content filtering. Deep learning and neural networks are explained with examples of image analysis. Common data issues are identified such as data integration, quality, and value.

Related topics:

Data Science Insights• Social Networking•

Department of Statistical Science
BOX 90251, DURHAM, NC 27708-0251
(919) 684-4210, WWW.STAT.DUKE.EDU
Spring (March 4 2019)
Sudipta Dasmohapatra (sd345@duke.edu)
Introduction to Data Science
and Machine Learning

Do you recognize this Company?

What is Data Science?

The Era of Big Data
• Digital data and ecommerce
• Online transactions
• Social media data
• Financial data
• Retail and other data
• Etc.
Volume, Velocity and Variety

Jobs in Data Science
• 25 best jobs of 2019 (US News)
https://money.usnews.com/
money/careers/slideshows/th
e-25-best-jobs?slide=27
…

Google Trends (US)

Google Trends (Data Science)

What is Data Science?
• Data Science is an area at the interface of
statistics, computer science, and mathematics
• Statisticians contributed a large inferential framework,
important Bayesian perspectives, the bootstrap and
CART and random forests, and the concepts of
sparsity and parsimony
• Computer scientists pioneered neural networks,
boosting, PAC bounds, and developed programming
languages such as Spark, Hadoop etc. for handling
Big Data
• Mathematicians contributed support vector machines,
modern optimization, tensor analysis and (maybe)
topological data analysis

What is Data Science?
• Data Science tries to find
hidden structure in large, high
dimensional datasets. But
there is significant variance in
the interpretability of results
• Interesting structure can arise
in regression analysis,
discriminant analysis, cluster
analysis, or more exotic
situations, such as
multidimensional scaling

What is Data Science?

Visualizations
• https://bost.ocks.org/mike/nations/
• https://observablehq.com/@d3/sankey-
diagram
• http://bl.ocks.org/nbremer/94db779237655
907b907

Machine Learning
• Machine learning is an application of artificial intelligence (AI)
that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed
• Machine learning focuses on the development of computer
programs that can access data and use it to learn for
themselves
https://towardsdatascience.com/machine-learning-65dbd95f1603

Applications of ML
• Google Photos: To recognize faces,
emotions, location, etc.
• Google Gmail: Content modeling
• Youtube: Improve search results
• Amazon: Product recommendations
• Facebook: Rank and personalize News
Feed stories, filtering out offensive
content, highlighting trending topics,
ranking search results, and recognizing
image and video content
• Uber: UberEATS to estimate time to
deliver food
hiphotos35/Getty Images/iStockphoto

Machine Learning Algorithms
• Supervised Learning: Response

Multiple Linear Regression (MLR)
• Models with more than one predictor
variable are called multiple regression
models.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀
Independent variables

Machine Learning Algorithms
• Supervised Learning: Response

List of Common Supervised ML
Algorithms
• Linear Regression: Approach to
modelling the relationship between a
response (or dependent variable) and
one or more explanatory variables (or
independent variables)
• Nearest Neighbor: ML algorithm for
classification and prediction (The
goal of the k-nearest neighbor
technique is to classify an unknown
observation by computing the
distance of the observation (on the
variables/features) to other previously
known groups or labels in data )
• Decision Trees: Decision Support tool
that makes decisions based on a tree
like structure to classify possible
outcomes

List of Common Supervised ML
Algorithms
• Support Vector Machines:
The goal of SVM is to find
the right hyperplane (line)
that can distinctly separate
the two classes
• Neural Networks: Neural
Networks (NN) are a class
of machine learning
techniques that are
modeled loosely after the
human brain, to recognize
patterns in the data

Machine Learning Algorithms
• Un-Supervised Learning: No Response
• Data analysis without a right answer
• You don’t have an outcome variable you are
seeking to fit or otherwise predict
• Often best applied as exploratory analysis en
route to predictive modeling

List of Common Unsupervised
ML Algorithms
• Cluster Analysis: A
clustering problem is
where you want to
discover the inherent
groupings in the data,
such as grouping
customers by
purchasing behavior
• Association Analysis: An
association rule learning
problem is where you
want to discover rules
that describe large
portions of your data,
such as people that buy
X also tend to buy Y
Data of arrests per 100,000 residents for
assault, murder, and rape in each of the 50 US
states in 1973

Machine Learning
Remember that machine learning only works if the problem is actually solvable
with the data that you have.

Traditional Modeling to Deep
Learning
Source: Cook, 2019, houseofbots.com

Deep Learning?
• A method that makes predictions using a
sequence of non-linear processing stages
• The resulting intermediate representations can
be interpreted as feature hierarchies and the
whole system is jointly learned from data
• Deep learning is a new way of fitting neural
networks

Image Analysis and Deep
Learning
• Images take a lot of space so
image compression important
• Image segmentation is the
process of partitioning a
digital image into multiple
segments (sets of pixels)
• Image segmentation is
typically used to locate
objects and boundaries (lines,
curves, etc.) in images.

Image Preprocessing and
Segmentation

Deep Learning?
Source: Deshpande, https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

Neural Networks Basic
Architecture

Deep Learning?
Source: Deshpande, https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

A Kernel
Hidden LayerInput Layer

Visualization of the First Layer

Multiple Applications

Common Data Issues and
Problems
• Too much data across multiple sources
– no consolidation or integration
• Turf war over who owns the data
• Issues of data quality
• Understanding the low hanging fruits
(data exploration and management,
standardization, visualization)
• Understanding value of data across
organization
• …

Questions?

Recommended

PPTX

Lecture #02

byKonpal Darakshan

PDF

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...

byMatthew Lease

PDF

Data Mining vs Statistics

byAndry Alamsyah

PPTX

Lecture #03

byKonpal Darakshan

PPTX

Techniques Machine Learning

byDataminingTools Inc

PDF

Intro to data visualization

PPTX

Application statistics in software engineering

PDF

Using Decision Trees to Analyze Online Learning Data

byShalin Hai-Jew

PPTX

Scientific Reproducibility from an Informatics Perspective

PPTX

Data Science

PPTX

Real life application of statistics in engineering

byJannatulFerdous160

PPTX

Recommenders, Topics, and Text

PDF

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...

PDF

Data Science Lecture: Overview and Information Collateral

PPTX

Data mining - Process, Techniques and Research Topics

PDF

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...

byMatthew Lease

PPTX

Data science

bySouravSadhukhan6

PPTX

Big data as a source for official statistics

byEdwin de Jonge

PPTX

Predictive Analytics - Display Advertising & Credit Card Acquisition Use cases

byBig Data Pulse

PDF

New data sources for statistics: Experiences at Statistics Netherlands.

byPiet J.H. Daas

PPTX

USING MRQAP TO ANALYSE THE DEVELOPMENT OF MATHEMATICS PRE-SERVICE TRAINEES’ C...

byChristian Bokhove

PDF

Strata Big data presentation

byPiet J.H. Daas

PPTX

Recommending Tags with a Model of Human Categorization

byChristoph Trattner

PDF

BIM Data Mining Unit3 by Tekendra Nath Yogi

byTekendra Nath Yogi

PDF

Machine Learning part 2 - Introduction to Data Science

PDF

Data Science in Industry - Applying Machine Learning to Real-world Challenges

PPTX

Session 04 communicating results

PPTX

Ml - A shallow dive

byGopi Krishna Nuti

PPTX

machine learning introduction notes foRr

PPTX

L15.pptx

More Related Content

PPTX

Lecture #02

byKonpal Darakshan

PDF

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...

byMatthew Lease

PDF

Data Mining vs Statistics

byAndry Alamsyah

PPTX

Lecture #03

byKonpal Darakshan

PPTX

Techniques Machine Learning

byDataminingTools Inc

PDF

Intro to data visualization

PPTX

Application statistics in software engineering

PDF

Using Decision Trees to Analyze Online Learning Data

byShalin Hai-Jew

Lecture #02

byKonpal Darakshan

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...

byMatthew Lease

Data Mining vs Statistics

byAndry Alamsyah

Lecture #03

byKonpal Darakshan

Techniques Machine Learning

byDataminingTools Inc

Intro to data visualization

Application statistics in software engineering

Using Decision Trees to Analyze Online Learning Data

byShalin Hai-Jew

What's hot

PPTX

Scientific Reproducibility from an Informatics Perspective

PPTX

Data Science

PPTX

Real life application of statistics in engineering

byJannatulFerdous160

PPTX

Recommenders, Topics, and Text

PDF

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...

PDF

Data Science Lecture: Overview and Information Collateral

PPTX

Data mining - Process, Techniques and Research Topics

PDF

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...

byMatthew Lease

PPTX

Data science

bySouravSadhukhan6

PPTX

Big data as a source for official statistics

byEdwin de Jonge

PPTX

Predictive Analytics - Display Advertising & Credit Card Acquisition Use cases

byBig Data Pulse

PDF

New data sources for statistics: Experiences at Statistics Netherlands.

byPiet J.H. Daas

PPTX

USING MRQAP TO ANALYSE THE DEVELOPMENT OF MATHEMATICS PRE-SERVICE TRAINEES’ C...

byChristian Bokhove

PDF

Strata Big data presentation

byPiet J.H. Daas

PPTX

Recommending Tags with a Model of Human Categorization

byChristoph Trattner

PDF

BIM Data Mining Unit3 by Tekendra Nath Yogi

byTekendra Nath Yogi

PDF

Machine Learning part 2 - Introduction to Data Science

PDF

Data Science in Industry - Applying Machine Learning to Real-world Challenges

PPTX

Session 04 communicating results

Scientific Reproducibility from an Informatics Perspective

Data Science

Real life application of statistics in engineering

byJannatulFerdous160

Recommenders, Topics, and Text

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...

Data Science Lecture: Overview and Information Collateral

Data mining - Process, Techniques and Research Topics

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...

byMatthew Lease

Data science

bySouravSadhukhan6

Big data as a source for official statistics

byEdwin de Jonge

Predictive Analytics - Display Advertising & Credit Card Acquisition Use cases

byBig Data Pulse

New data sources for statistics: Experiences at Statistics Netherlands.

byPiet J.H. Daas

USING MRQAP TO ANALYSE THE DEVELOPMENT OF MATHEMATICS PRE-SERVICE TRAINEES’ C...

byChristian Bokhove

Strata Big data presentation

byPiet J.H. Daas

Recommending Tags with a Model of Human Categorization

byChristoph Trattner

BIM Data Mining Unit3 by Tekendra Nath Yogi

byTekendra Nath Yogi

Machine Learning part 2 - Introduction to Data Science

Data Science in Industry - Applying Machine Learning to Real-world Challenges

Session 04 communicating results

Similar to NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in Data Science and Machine Learning - Sudipta Dasmohapatra, March 4, 2019

PPTX

Ml - A shallow dive

byGopi Krishna Nuti

PPTX

machine learning introduction notes foRr

PPTX

L15.pptx

PDF

An Elementary Introduction to Artificial Intelligence, Data Science and Machi...

PDF

General introduction to AI ML DL DS

byRoopesh Kohad

PDF

Introduction to Data Science

byChristy Abraham Joy

PDF

Data Mining algorithms PPT with Overview explanation.

bypromptitude123456789

PDF

Machine Learning Deep Learning AI and Data Science

byVenkata Reddy Konasani

PDF

Introduction to machine learning and applications (1)

byManjunath Sindagi

PDF

Intro to machine learning

PDF

Machine-Learning for Data analytics and detection

PPTX

Machine Learning DR PRKRao-PPT UNIT-I.pptx

PPTX

Altron presentation on Emerging Technologies: Data Science and Artificial Int...

byRobert Williams

PPTX

Unit - 1 - Introduction of the machine learning

byTaranpreet Singh

PPTX

MachineLearning_AishwaryaCR

byAishwarya C Ramachandran

PPTX

BIG DATA AND MACHINE LEARNING

byUmair Shafique

PPTX

What is Machine Learning.pptx

PPTX

Internship - Python - AI ML.pptx

byHchethankumar

PPTX

Internship - Python - AI ML.pptx

byHchethankumar

PPTX

Lecture 1.pptxgggggggggggggggggggggggggggggggggggggggggggg

byAjayKumar773878

Ml - A shallow dive

byGopi Krishna Nuti

machine learning introduction notes foRr

L15.pptx

An Elementary Introduction to Artificial Intelligence, Data Science and Machi...

General introduction to AI ML DL DS

byRoopesh Kohad

Introduction to Data Science

byChristy Abraham Joy

Data Mining algorithms PPT with Overview explanation.

bypromptitude123456789

Machine Learning Deep Learning AI and Data Science

byVenkata Reddy Konasani

Introduction to machine learning and applications (1)

byManjunath Sindagi

Intro to machine learning

Machine-Learning for Data analytics and detection

Machine Learning DR PRKRao-PPT UNIT-I.pptx

Altron presentation on Emerging Technologies: Data Science and Artificial Int...

byRobert Williams

Unit - 1 - Introduction of the machine learning

byTaranpreet Singh

MachineLearning_AishwaryaCR

byAishwarya C Ramachandran

BIG DATA AND MACHINE LEARNING

byUmair Shafique

What is Machine Learning.pptx

Internship - Python - AI ML.pptx

byHchethankumar

Internship - Python - AI ML.pptx

byHchethankumar

Lecture 1.pptxgggggggggggggggggggggggggggggggggggggggggggg

byAjayKumar773878

More from The Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - A Bracketing Relationship between Differe...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...

byThe Statistical and Applied Mathematical Sciences Institute

PPTX

Causal Inference Opening Workshop - Difference-in-differences: more than meet...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...

byThe Statistical and Applied Mathematical Sciences Institute

PPTX

Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...

byThe Statistical and Applied Mathematical Sciences Institute

PPTX

2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...

byThe Statistical and Applied Mathematical Sciences Institute

PPTX

2019 Fall Series: Professional Development, Writing Academic Papers…What Work...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...

byThe Statistical and Applied Mathematical Sciences Institute

PDF

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...

byThe Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - A Bracketing Relationship between Differe...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Difference-in-differences: more than meet...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...

byThe Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...

byThe Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...

byThe Statistical and Applied Mathematical Sciences Institute

2019 Fall Series: Professional Development, Writing Academic Papers…What Work...

byThe Statistical and Applied Mathematical Sciences Institute

2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...

byThe Statistical and Applied Mathematical Sciences Institute

2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

byThe Statistical and Applied Mathematical Sciences Institute

Recently uploaded

PPTX

ATTENTION - PART 1.pptx cognitive processes -For B.Sc I Sem By Mrs.Shilpa Hot...

bySHILPA HOTAKAR

PPTX

Prelims - History and Geography Quiz - Around the World in 80 Questions - IITK

PPTX

LYMPHATIC SYSTEM.pptx it includes lymph, lymph nodes, bone marrow, spleen

PDF

1. Doing Academic Research: Problems and Issues, 2. Academic Research Writing...

byProf. Vinod Kumar Kanvaria

PPTX

The hidden treasures Grade 5 Story with Motive Questions.pptx

byGildaPetillaDelaCruz

PDF

Risk Management and Regulatory Compliance - by Ms. Oceana Wong

PDF

AI Workflows and Workflow Rhetoric - by Ms. Oceana Wong

PDF

Digital Electronics – Registers and Their Applications

PPTX

What are New Features in Purchase _Odoo 18

byCeline George

PPTX

Anatomy of the eyeball An overviews.pptx

byMushahidRaza8

PDF

ASRB NET 2025 Paper GENETICS AND PLANT BREEDING ARS, SMS & STODiscussion | Co...

bySatyam Sharma

PPTX

G-Protein-Coupled Receptors (GPCRs): Structure, Mechanism, and Functions

PPTX

Photography Pillar 1 The Subject PowerPoint

PPTX

Finals - History and Geography Quiz - Around the World in 80 Questions - IITK

PPTX

Declaration of Helsinki Basic principles in medical research ppt.pptx

PPTX

A Presentation of PMES 2025-2028 with Salient features.pptx

PPTX

Elderly in India: The Changing Scenario.pptx

PPTX

Time Series Analysis - Meaning, Definition, Components and Application

PPTX

DEPED MEMORANDUM 089, 2025 PMES guidelines pptx

PPTX

General Wellness & Restorative Tonic: Draksharishta

byDr. Paindla Jyothirmai

ATTENTION - PART 1.pptx cognitive processes -For B.Sc I Sem By Mrs.Shilpa Hot...

bySHILPA HOTAKAR

Prelims - History and Geography Quiz - Around the World in 80 Questions - IITK

LYMPHATIC SYSTEM.pptx it includes lymph, lymph nodes, bone marrow, spleen

1. Doing Academic Research: Problems and Issues, 2. Academic Research Writing...

byProf. Vinod Kumar Kanvaria

The hidden treasures Grade 5 Story with Motive Questions.pptx

byGildaPetillaDelaCruz

Risk Management and Regulatory Compliance - by Ms. Oceana Wong

AI Workflows and Workflow Rhetoric - by Ms. Oceana Wong

Digital Electronics – Registers and Their Applications

What are New Features in Purchase _Odoo 18

byCeline George

Anatomy of the eyeball An overviews.pptx

byMushahidRaza8

ASRB NET 2025 Paper GENETICS AND PLANT BREEDING ARS, SMS & STODiscussion | Co...

bySatyam Sharma

G-Protein-Coupled Receptors (GPCRs): Structure, Mechanism, and Functions

Photography Pillar 1 The Subject PowerPoint

Finals - History and Geography Quiz - Around the World in 80 Questions - IITK

Declaration of Helsinki Basic principles in medical research ppt.pptx

A Presentation of PMES 2025-2028 with Salient features.pptx

Elderly in India: The Changing Scenario.pptx

Time Series Analysis - Meaning, Definition, Components and Application

DEPED MEMORANDUM 089, 2025 PMES guidelines pptx

General Wellness & Restorative Tonic: Draksharishta

byDr. Paindla Jyothirmai

NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in Data Science and Machine Learning - Sudipta Dasmohapatra, March 4, 2019

1.
Department of StatisticalScience BOX 90251, DURHAM, NC 27708-0251 (919) 684-4210, WWW.STAT.DUKE.EDU Spring (March 4 2019) Sudipta Dasmohapatra (sd345@duke.edu) Introduction to Data Science and Machine Learning
2.
Do you recognizethis Company?
3.
What is DataScience?
4.
The Era ofBig Data • Digital data and ecommerce • Online transactions • Social media data • Financial data • Retail and other data • Etc. Volume, Velocity and Variety
5.
Jobs in DataScience • 25 best jobs of 2019 (US News) https://money.usnews.com/ money/careers/slideshows/th e-25-best-jobs?slide=27 …
6.
Google Trends (US)
7.
Google Trends (DataScience)
8.
What is DataScience? • Data Science is an area at the interface of statistics, computer science, and mathematics • Statisticians contributed a large inferential framework, important Bayesian perspectives, the bootstrap and CART and random forests, and the concepts of sparsity and parsimony • Computer scientists pioneered neural networks, boosting, PAC bounds, and developed programming languages such as Spark, Hadoop etc. for handling Big Data • Mathematicians contributed support vector machines, modern optimization, tensor analysis and (maybe) topological data analysis
9.
What is DataScience? • Data Science tries to find hidden structure in large, high dimensional datasets. But there is significant variance in the interpretability of results • Interesting structure can arise in regression analysis, discriminant analysis, cluster analysis, or more exotic situations, such as multidimensional scaling
10.
What is DataScience?
11.
Visualizations • https://bost.ocks.org/mike/nations/ • https://observablehq.com/@d3/sankey- diagram •http://bl.ocks.org/nbremer/94db779237655 907b907
12.
Machine Learning • Machinelearning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed • Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves https://towardsdatascience.com/machine-learning-65dbd95f1603
13.
Applications of ML •Google Photos: To recognize faces, emotions, location, etc. • Google Gmail: Content modeling • Youtube: Improve search results • Amazon: Product recommendations • Facebook: Rank and personalize News Feed stories, filtering out offensive content, highlighting trending topics, ranking search results, and recognizing image and video content • Uber: UberEATS to estimate time to deliver food hiphotos35/Getty Images/iStockphoto
14.
Machine Learning Algorithms •Supervised Learning: Response
16.
Multiple Linear Regression(MLR) • Models with more than one predictor variable are called multiple regression models. 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽 𝑘 𝑥 𝑘 + 𝜀 Independent variables
17.
Machine Learning Algorithms •Supervised Learning: Response
18.
List of CommonSupervised ML Algorithms • Linear Regression: Approach to modelling the relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables) • Nearest Neighbor: ML algorithm for classification and prediction (The goal of the k-nearest neighbor technique is to classify an unknown observation by computing the distance of the observation (on the variables/features) to other previously known groups or labels in data ) • Decision Trees: Decision Support tool that makes decisions based on a tree like structure to classify possible outcomes
19.
List of CommonSupervised ML Algorithms • Support Vector Machines: The goal of SVM is to find the right hyperplane (line) that can distinctly separate the two classes • Neural Networks: Neural Networks (NN) are a class of machine learning techniques that are modeled loosely after the human brain, to recognize patterns in the data
20.
Machine Learning Algorithms •Un-Supervised Learning: No Response • Data analysis without a right answer • You don’t have an outcome variable you are seeking to fit or otherwise predict • Often best applied as exploratory analysis en route to predictive modeling
21.
List of CommonUnsupervised ML Algorithms • Cluster Analysis: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior • Association Analysis: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y Data of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973
22.
Machine Learning Remember thatmachine learning only works if the problem is actually solvable with the data that you have.
23.
Traditional Modeling toDeep Learning Source: Cook, 2019, houseofbots.com
24.
Deep Learning? • Amethod that makes predictions using a sequence of non-linear processing stages • The resulting intermediate representations can be interpreted as feature hierarchies and the whole system is jointly learned from data • Deep learning is a new way of fitting neural networks
25.
Image Analysis andDeep Learning • Images take a lot of space so image compression important • Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels) • Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.
26.
Image Preprocessing and Segmentation
27.
Deep Learning? Source: Deshpande,https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
28.
Neural Networks Basic Architecture
29.
Deep Learning? Source: Deshpande,https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
30.
A Kernel Hidden LayerInputLayer
31.
Visualization of theFirst Layer
32.
Multiple Applications
33.
Common Data Issuesand Problems • Too much data across multiple sources – no consolidation or integration • Turf war over who owns the data • Issues of data quality • Understanding the low hanging fruits (data exploration and management, standardization, visualization) • Understanding value of data across organization • …
34.
Questions?

Editor's Notes

#3 Can you identify what company could this be? There are a couple things here. See that we are buying something. It shows recommendations and lists of what other customers are buying. We are using data science to generate this list.
#5 This is the era of big data – we have data coming from everywhere – what that means is that we need to have resources and skills to analyze these data. Imagine data from all these sources and in all these industries. Small companies as well as big realize that there is value in looking at and evaluating data. Big Data is any data that is expensive to manage and hard to extract value from Volume The size of the data Velocity The latency of data processing relative to the growing demand for interactivity Variety and Complexity the diversity of sources, formats, quality, structures.
#6 Both 1 and 2 are very closely related to data science (computer science + statistics + Math)
#9 There are some cultural differences A key concept in data science is sparsity, which is closely related to parsimony and regularization. One wants to have the simplest possible model that is adequate to ones purpose. This often implies that the model is parsimonious (containing only few terms) and this may be achieved by regularization (e.g., forcing terms with small coefficients to zero) Sparsity is essentially Ockham’s Razor, and is a key idea in all inferential paradigms. It takes many forms. CART- classification and regression trees PAC: Probability actually corrected learnings “Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.” You got that? Let me explain it with an example: Suppose, out of all the 4 championship races (F1) between Niki Lauda and James hunt, Niki won 3 times while James managed only 1. So, if you were to bet on the winner of next race, who would he be ? I bet you would say Niki Lauda. Here’s the twist. What if you are told that it rained once when James won and once when Niki won and it is definite that it will rain on the next date. So, who would you bet your money on now ? By intuition, it is easy to see that chances of winning for James have increased drastically. But the question is: how much ? Substituting the values in the conditional probability formula, we get the probability to be around 50%, which is almost the double of 25% when rain was not taken into account (Solve it at your end). This further strengthened our belief of James winning in the light of new evidence i.e rain. You must be wondering that this formula bears close resemblance to something you might have heard a lot about. Think! Probably, you guessed it right. It looks like Bayes Theorem. Bayes theorem is built on top of conditional probability and lies in the heart of Bayesian Inference.
#10 Statisticians tend to favor interpretation, whereas computer scientists often prefer black box models with good accuracy and broad applicability.
#13 Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. For example, one kind of algorithm is a classification algorithm. It can put data into different groups. The same classification algorithm used to recognize handwritten numbers could also be used to classify emails into spam and not-spam without changing a line of code. It’s the same algorithm but it’s fed different training data so it comes up with different classification logic.
#14 Machine Learning is being widely used nowadays. Some of the examples which we are using on daily basis: Facebook has used machine learning in ranking and personalizing News Feed stories, filtering out offensive content, highlighting trending topics, ranking search results, and recognizing image and video content. Google uses Machine learning almost in every product: Photos -: Uses machine learning to recognize the faces, location, emotions etc. Gmail -: Analyses the content in email and provide the smart replies. Youtube: Youtube uses the machine learning to improve the search result. Previously it is used to search according to meta tag and text provided by content creator but now it analyses the video content and provide best content to the user. Amazon uses machine learning for product recommendation. Uber uses the machine learning in UberEATS to calculate estimated amount of time to delivery food.
#15 Machine learning” is an umbrella term covering lots of these kinds of generic algorithms: Supervised Learning Let’s say you are a real estate agent. Your business is growing, so you hire a bunch of new trainee agents to help you out. But there’s a problem — you can glance at a house and have a pretty good idea of what a house is worth, but your trainees don’t have your experience so they don’t know how to price their houses. To help your trainees (and maybe free yourself up for a vacation), you decide to write a little app that can estimate the value of a house in your area based on it’s size, neighborhood, etc, and what similar houses have sold for. So you write down every time someone sells a house in your city for 3 months. For each house, you write down a bunch of details — number of bedrooms, size in square feet, neighborhood, etc. But most importantly, you write down the final sale price: Using that training data, we want to create a program that can estimate how much any other house in your area is worth. This is called supervised learning. You knew how much each house sold for, so in other words, you knew the answer to the problem and could work backwards from there to figure out the logic. To build your app, you feed your training data about each house into your machine learning algorithm. The algorithm is trying to figure out what kind of math needs to be done to make the numbers work out In supervised learning, you are letting the computer work out that relationship for you. And once you know what math was required to solve this specific set of problems, you could answer to any other problem of the same type!
#17 The method for finding the line of best fit for multiple linear regression is the exact same for simple linear regression – the least squares method. The only thing that has changed is the predicted value of the response, 𝑦 𝑖 .
#18 Machine learning” is an umbrella term covering lots of these kinds of generic algorithms: Supervised Learning Let’s say you are a real estate agent. Your business is growing, so you hire a bunch of new trainee agents to help you out. But there’s a problem — you can glance at a house and have a pretty good idea of what a house is worth, but your trainees don’t have your experience so they don’t know how to price their houses. To help your trainees (and maybe free yourself up for a vacation), you decide to write a little app that can estimate the value of a house in your area based on it’s size, neighborhood, etc, and what similar houses have sold for. So you write down every time someone sells a house in your city for 3 months. For each house, you write down a bunch of details — number of bedrooms, size in square feet, neighborhood, etc. But most importantly, you write down the final sale price: Using that training data, we want to create a program that can estimate how much any other house in your area is worth. This is called supervised learning. You knew how much each house sold for, so in other words, you knew the answer to the problem and could work backwards from there to figure out the logic. To build your app, you feed your training data about each house into your machine learning algorithm. The algorithm is trying to figure out what kind of math needs to be done to make the numbers work out In supervised learning, you are letting the computer work out that relationship for you. And once you know what math was required to solve this specific set of problems, you could answer to any other problem of the same type!
#21 Let’s go back to our original example with the real estate agent. What if you didn’t know the sale price for each house? Even if all you know is the size, location, etc of each house, it turns out you can still do some really cool stuff. This is called unsupervised learning. This is kind of like someone giving you a list of numbers on a sheet of paper and saying “I don’t really know what these numbers mean but maybe you can figure out if there is a pattern or grouping or something — good luck!” So unsupervised learning is a broad term encompassing data analysis without a right answer. So what could do with this data? For starters, you could have an algorithm that automatically identified different market segments in your data. Maybe you’d find out that home buyers in the neighborhood near the local college really like small houses with lots of bedrooms, but home buyers in the suburbs prefer 3-bedroom houses with lots of square footage. Knowing about these different kinds of customers could help direct your marketing efforts. Another cool thing you could do is automatically identify any outlier houses that were way different than everything else. Maybe those outlier houses are giant mansions and you can focus your best sales people on those areas because they have bigger commissions.
#23 But it’s important to remember that machine learning only works if the problem is actually solvable with the data that you have. For example, if you build a model that predicts home prices based on the type of potted plants in each house, it’s never going to work. There just isn’t any kind of relationship between the potted plants in each house and the home’s sale price. So no matter how hard it tries, the computer can never deduce a relationship between the two. So remember, if a human expert couldn’t use the data to solve the problem manually, a computer probably won’t be able to either. Instead, focus on problems where a human could solve the problem, but where it would be great if a computer could solve it much more quickly. https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471 https://medium.com/@ageitgey/machine-learning-is-fun-part-2-a26a10b68df3
#25 Are you tired of reading endless news stories about deep learning and not really knowing what that means? Let’s change that!
#26 Slide 4 Images are very large – you can imagine a dataset with 10 images could be 100 mega bytes. What will happen when you have 1000 images or 10000 images. We’re working with color images, each with dimension x, y, z, where x and y are specific to each photo. Image files, insofar as a computer understands them, are three layers of matrices stacked on top of each other, with each pixel being an individual entry in that matrix. So, to begin with, we use image algorithms to compress these images for processing. Manytimes, we grayscale and resize images so they’re smaller to work with. In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. You can see from this slide that the image of the traffic and surrounding view in the top figure is pixelated and then classified into segments that are represented as various colors in the bottom figure.
#27 Slide 5 This process can be clarified further by looking at another example. Image classification is the process of taking an input image and outputting a class number out of a set of categories. So, for example, if we knew that our data consists of images of dogs, cats, birds, etc. We first classify one image as a dog based on the training model that looks at all the other images. However, our job is not only to produce a class label but also a bounding box that describes where the object is in the picture. We also have the task of object detection, where localization needs to be done on all of the objects in the image. Therefore, you will have multiple bounding boxes and multiple class labels. Finally, in segmentation the task is to output a class label as well as an outline of every object in the input image.
#28 Any 3-year-old child can recognize a photo of a bird, but figuring out how to make a computer recognize objects has puzzled the very best computer scientists for over 50 years. In the last few years, we’ve finally found a good approach to object recognition using deep convolutional neural networks. That sounds like a a bunch of made up words from a William Gibson Sci-Fi novel, but the ideas are totally understandable if you break them down one by one.
#29 A NN typically contains one input layer, one or more hidden layers, and an output layer. The input layer consists of your p predictors, or input units / nodes. Needless to say, it is generally good practice to center, scale and transform predictors, if not at least to speed up the optimization procedure. These input units can be connected to one or more hidden units in the first hidden layer. A hidden layer that is fully connected to the preceding layer is designated dense. In the diagram below, both hidden layers are dense. The output layer computes the prediction, and the number of units therein is determined by the problem in hands. Conventionally, a binary classification problem requires a single output unit (as shown above), whereas a multiclass problem with k classes will require k corresponding output units. The former can simply use a sigmoid function to directly compute a probability, while the latter usually requires a softmax transformation, whereby all values across all k output units sum up to one and can thus be treated as probabilities. Rather than having categorical predictions you can retrieve the actual probabilities, which are much more informative, and inspect their quality using calibration plots and lift charts. Every arrow displayed in the diagram above passes on an input that is associated with a weight. Each weight is essentially one of many coefficient estimates that contribute to the regressions computed in the nodes the corresponding arrows point to. These are unknown parameters that must be tuned by the model as to minimize the loss function, using an optimization procedure. In effect, for any particular observation each neuron can be mathematically represented as the equation that you see here. In this equation b denotes the intercept (also known as bias, and technically a weight itself) and W and x are m-long vectors carrying the weights and values from all m inputs, respectively. Before training, all weights are initialized with random values
#30 Any 3-year-old child can recognize a photo of a bird, but figuring out how to make a computer recognize objects has puzzled the very best computer scientists for over 50 years. In the last few years, we’ve finally found a good approach to object recognition using deep convolutional neural networks. That sounds like a a bunch of made up words from a William Gibson Sci-Fi novel, but the ideas are totally understandable if you break them down one by one.
#31 Slide 31 So to give you a flavor of some kernels – we look at an example of a kernel we use for blurring. So, you know when you blur out a photo in photoshop, it does something like this. For each pixel, it takes a weighted average of the pixels around it. So the pixel in the center will get a weight of 41 and the pixel closest gets a weight of 26 and the pixels at the edges get a weight of 1 but it will influence it. And on the other hand, you may want to emphasize edges which means you emphasize contrast. So a kernel that does contrast takes into account differences. So, the pixel in the middle may be zero you may do negatives at the top and positives at the bottom. These things you could try yourself in a photograph and see that it pick out those edges for you. So convolutional neural networks will develop many different kernels that themselves are learned. So you may say, we need blurring or we need features like contrast. You don’t have to figure out these features by yourself. As a process of this complex model fitting, all the way backward from the right answer neural network will figure out what features are needed. This is all accomplished with one fitting process. CNNs take these averages of pixels in different ways in parallel, so one detects edges, one roundness, and scores them all to “a face score” or a car score.
#32 Slide 34 Now, let’s go back to visualizing this mathematically. When we have this filter at the top left corner of the input volume, it is computing multiplications between the filter and the pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner. Remember, what we have to do is multiply the values in the filter with the original pixel values of the image. Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the multiplications summed together will result in a large value! Now let’s see what happens when we move our filter