SlideShare a Scribd company logo
Introduction to
Data Science

Prithwis Mukerjee, PhD
Praxis Business School, Calcutta
prithwis mukerjee, ph.d.
Agenda
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
Volume
Data is being acquired from a
variety of sources
●
●
●
●
●
●
●

EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites

prithwis mukerjee, ph.d.
Variety / Velocity
●
●
●
●
●
●

Numeric data
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other

prithwis mukerjee, ph.d.

Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●

Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”
So what is Big Data ?
●
●
●

Volume
Velocity
Variety ?

A new term coined by
IT vendors to push new
technology like
●
●
●

prithwis mukerjee, ph.d.

Map Reduce
Hadoop
NOSQL

A new way to
●
●
●
●
●

collect
store
manage
analyse
visualise data
Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos

But what
about
refining ?
prithwis mukerjee, ph.d.
The Science (and Art ) of Data
Think of data as crude oil !

Data Science
●

Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos

prithwis mukerjee, ph.d.

●
●
●

Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories
Two Perspectives

Programming
or “Hacking”
Skills

Machine
Learning

Mathematics,
Statistics
Knowledge

Data
Science
RDBMS
ERP / BI

Operations
Research

Business
Domain
Knowledge

prithwis mukerjee, ph.d.
10 Things {most} Data Scientists do ...
1. Ask good questions

6. Create models, algorithms

What is what ?

7. Under data relationships

We do not know ! We would like to
know

8. Tell the machine how to learn
from the data

2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns

prithwis mukerjee, ph.d.

9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data
Statistics - World of Data
●

Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count

prithwis mukerjee, ph.d.

●

Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes
Descriptive Statistics
●

Numeric Description
○ Mean, Median, Mode
○ Quartile, Percentile
○ Variance / Standard
Deviation

prithwis mukerjee, ph.d.
Statistics : The Path Ahead

Probability,
Distributions

prithwis mukerjee, ph.d.

Testing of
Hypothesis

Regression,
Testing

Predictive
Analysis
Data Mining / Machine Learning
Is the process of obtaining

Typical tasks are

●

novel

●

classification

●

valid

●

clustering

●

potentially useful

●

association rules

●

understandable

●

sequential patterns

●

regression

●

deviation detection

patterns in data

prithwis mukerjee, ph.d.
Some definitions
Instance ( an item or record)
●

an observation that is
characterised by a number of
attributes
○
○

person - with attributes like age,
salary, qualification
sale - with product, quantity, price

Attribute
●

measuring characteristics of an
instance

Class
●

grouping of an instance into
○
○

acceptable, not acceptable
mammal, fish, bird
prithwis mukerjee, ph.d.

Nominal
●

colour, PIN code, state

Ordinal
●

ranking : tall, medium, short or
feedback on a scale of 1 - 10

Ratio
●

length, price, duration, quantity

Interval
●

date, temperature
Data Mining : Classification
Classification
●
●

Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?

prithwis mukerjee, ph.d.
Classification Example
s
l
ca uou
ri
go ontin lass
c
ate c

l

a
ric

o

teg
ca

c

Test
Set

Learn
Classifier

prithwis mukerjee, ph.d.

Training
Set

Model
Data Mining : Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them

●

Can we segment the market in
some way that is not yet known ?

prithwis mukerjee, ph.d.
Example of Document Clustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )

prithwis mukerjee, ph.d.
Clustering of S&P Stock Data
●
●
●

●

Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.

prithwis mukerjee, ph.d.
Regression
● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○

Greatly studied in statistics, neural network fields.

● Examples:
○

Predicting sales amounts of new product based on advertising
expenditure.

○

Predicting wind velocities as a function of temperature, humidity, air

○

pressure, etc.
Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
Data Mining : Association Rules Mining
Association Rules
●

●

which products
should be kept
along with other
products
which two
products should
never be
discounted
together

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Definitions
Data Mining
●

●

Is the process of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data
prithwis mukerjee, ph.d.

Data Science is a rare combination of
multiple skills that include
●

Technology : obviously !

but also
●

●
●

Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
R : Your first step into Data Science

prithwis mukerjee, ph.d.

Try out this free interactive tutorial just now
Statistical Tools

prithwis mukerjee, ph.d.

http://r4stats.com/articles/popularity/
Some Comparisons

prithwis mukerjee, ph.d.
Map Reduce
●
●

●

Input : A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs

prithwis mukerjee, ph.d.
Hadoop
A programming framework that
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●

the Map and Reduce functions
loading data into HDFS

1.

2.

3.
4.

prithwis mukerjee, ph.d.

HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS
Data-in-Flight

prithwis mukerjee, ph.d.
JavaScript for Data Visualisation

prithwis mukerjee, ph.d.
Business Domain
●

●

Financial Sector
○ Risk Management, Credit
Scoring
○ Predict Customer Spend
○ Stock and Investment
Analysis
○ Loan approval
Telecom Sector
○ Fraud Detection
○ Churn Prediction

prithwis mukerjee, ph.d.

●

●

Retail and Marketing
○ Market segmentation
○ Promotional strategy
○ Market Basket Analysis
○ Trend Analysis
Healthcare & Insurance
○ Fraud Detection
○ Drug Development
○ Medical Diagnostic Tools
Conclusion
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

Data Science is a rare combination of
multiple skills that include
●

but also
●

●
●

prithwis mukerjee, ph.d.

Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
Thank You
Contact

This presentation is accessible at at
the blog

Prithwis Mukerjee
Professor, Praxis Business School

http://blog.yantrajaal.com

prithwis@praxis.ac.in

at the following URL
http://bit.ly/pm-datascience

prithwis mukerjee, ph.d.

More Related Content

What's hot

Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
Victoria López
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
R Programming: Variables & Data Types
R Programming: Variables & Data TypesR Programming: Variables & Data Types
R Programming: Variables & Data Types
Rsquared Academy
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
deepti92pawar
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Simplilearn
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
JigsawAcademy2014
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1
Korivi Sravan Kumar
 
Data science
Data scienceData science
Data science
Mohamed Loey
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Laguna State Polytechnic University
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
R programming
R programmingR programming
R programming
Shantanu Patil
 

What's hot (20)

Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
R Programming: Variables & Data Types
R Programming: Variables & Data TypesR Programming: Variables & Data Types
R Programming: Variables & Data Types
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
 
Data analytics
Data analyticsData analytics
Data analytics
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Data analytics
Data analyticsData analytics
Data analytics
 
R programming
R programmingR programming
R programming
 

Similar to Data Science

Data science guide
Data science guideData science guide
Data science guide
gokulprasath06
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
NagarajanG35
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
Rubikal
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
Ahmed Amr Abdul-Fattah
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
DeZyre
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
IRJET Journal
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
DataRobot
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Tomasz Bednarz
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
PrashantYadav931011
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
IRJET Journal
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
Καρολίνα Κάτι
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
Data science / Big Data
Data science / Big DataData science / Big Data
Data science / Big Data
Yasas Senarath
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
NagarajanG35
 

Similar to Data Science (20)

Data science guide
Data science guideData science guide
Data science guide
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Data science / Big Data
Data science / Big DataData science / Big Data
Data science / Big Data
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 

More from Prithwis Mukerjee

Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Prithwis Mukerjee
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3
Prithwis Mukerjee
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
Prithwis Mukerjee
 
Thought controlled devices
Thought controlled devicesThought controlled devices
Thought controlled devices
Prithwis Mukerjee
 
Cloudcasting
CloudcastingCloudcasting
Cloudcasting
Prithwis Mukerjee
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and Bitcoins
Prithwis Mukerjee
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6Prithwis Mukerjee
 
Thought control
Thought controlThought control
Thought control
Prithwis Mukerjee
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2
Prithwis Mukerjee
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2
Prithwis Mukerjee
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsPrithwis Mukerjee
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?
Prithwis Mukerjee
 
Data mining clustering-2009-v0
Data mining clustering-2009-v0Data mining clustering-2009-v0
Data mining clustering-2009-v0Prithwis Mukerjee
 
Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0Prithwis Mukerjee
 
PPM Lite
PPM LitePPM Lite
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session I
Prithwis Mukerjee
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
Prithwis Mukerjee
 

More from Prithwis Mukerjee (20)

Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Thought controlled devices
Thought controlled devicesThought controlled devices
Thought controlled devices
 
Cloudcasting
CloudcastingCloudcasting
Cloudcasting
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and Bitcoins
 
05 OLAP v6 weekend
05 OLAP  v6 weekend05 OLAP  v6 weekend
05 OLAP v6 weekend
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6
 
Thought control
Thought controlThought control
Thought control
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & Analytics
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?
 
Data mining clustering-2009-v0
Data mining clustering-2009-v0Data mining clustering-2009-v0
Data mining clustering-2009-v0
 
Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0
 
Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
Data mining intro-2009-v2
Data mining intro-2009-v2Data mining intro-2009-v2
Data mining intro-2009-v2
 
PPM Lite
PPM LitePPM Lite
PPM Lite
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session I
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 

Recently uploaded

How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 

Recently uploaded (20)

How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 

Data Science

  • 1. Introduction to Data Science Prithwis Mukerjee, PhD Praxis Business School, Calcutta prithwis mukerjee, ph.d.
  • 2. Agenda ● ● ● ● Why data science ? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains prithwis mukerjee, ph.d.
  • 4. Volume Data is being acquired from a variety of sources ● ● ● ● ● ● ● EFT in Banks, Credit card payments Cell phones Sensors attached to a variety of equipment Surveillance cameras, CCTV Social Media Updates Blogs Websites prithwis mukerjee, ph.d.
  • 5. Variety / Velocity ● ● ● ● ● ● Numeric data Structured text data Unstructured text data Images Sound and video recordings Graph Nodes ○ Social Media “friends” ○ Websites linked to each other prithwis mukerjee, ph.d. Data is being generated fast and is becoming obsolete or useless equally faster ● ● ● Realtime ( or near realtime) data from sensors, cameras Website traffic Social media “trends”
  • 6. So what is Big Data ? ● ● ● Volume Velocity Variety ? A new term coined by IT vendors to push new technology like ● ● ● prithwis mukerjee, ph.d. Map Reduce Hadoop NOSQL A new way to ● ● ● ● ● collect store manage analyse visualise data
  • 7. Big Data is like Crude Oil { not new Oil } Think of data as crude oil ! Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos But what about refining ? prithwis mukerjee, ph.d.
  • 8. The Science (and Art ) of Data Think of data as crude oil ! Data Science ● Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in Refining massive silos prithwis mukerjee, ph.d. ● ● ● Discovering what we do not know about the data Obtaining predictive, actionable insight Creating data products that have business impacts Communicating relevent business stories
  • 10. 10 Things {most} Data Scientists do ... 1. Ask good questions 6. Create models, algorithms What is what ? 7. Under data relationships We do not know ! We would like to know 8. Tell the machine how to learn from the data 2. Define, Test Hypothesis, Run experiments 3, Scoop, scrape, sample business data 4. Wrestle and tame data 5. Play with data, discover unknowns prithwis mukerjee, ph.d. 9. Create data products that deliver actionable insights 10. Tell relevant business stories from data
  • 11. Statistics - World of Data ● Data comes in various types ○ Nominal - colour, gender, PIN code ○ Ordinal - scale of 1-10, {high, medium, low} ○ Interval - Dates, Temperature (Centigrade) ○ Ratio - length, weight, count prithwis mukerjee, ph.d. ● Data comes in various structure ○ Structured data - nominal, ordinal, interval, ratio ○ Unstructured text - email, tweets, reviews ○ Images, voice prints ○ graphs, networks - social media friendships, likes
  • 12. Descriptive Statistics ● Numeric Description ○ Mean, Median, Mode ○ Quartile, Percentile ○ Variance / Standard Deviation prithwis mukerjee, ph.d.
  • 13. Statistics : The Path Ahead Probability, Distributions prithwis mukerjee, ph.d. Testing of Hypothesis Regression, Testing Predictive Analysis
  • 14. Data Mining / Machine Learning Is the process of obtaining Typical tasks are ● novel ● classification ● valid ● clustering ● potentially useful ● association rules ● understandable ● sequential patterns ● regression ● deviation detection patterns in data prithwis mukerjee, ph.d.
  • 15. Some definitions Instance ( an item or record) ● an observation that is characterised by a number of attributes ○ ○ person - with attributes like age, salary, qualification sale - with product, quantity, price Attribute ● measuring characteristics of an instance Class ● grouping of an instance into ○ ○ acceptable, not acceptable mammal, fish, bird prithwis mukerjee, ph.d. Nominal ● colour, PIN code, state Ordinal ● ranking : tall, medium, short or feedback on a scale of 1 - 10 Ratio ● length, price, duration, quantity Interval ● date, temperature
  • 16. Data Mining : Classification Classification ● ● Which loan applicant will not default on the loan ? Which potential customer will respond to a mailer campaign ? prithwis mukerjee, ph.d.
  • 17. Classification Example s l ca uou ri go ontin lass c ate c l a ric o teg ca c Test Set Learn Classifier prithwis mukerjee, ph.d. Training Set Model
  • 18. Data Mining : Clustering Given a set of unclassified data points, how to find a natural grouping within them ● Can we segment the market in some way that is not yet known ? prithwis mukerjee, ph.d.
  • 19. Example of Document Clustering Clustering points : 3204 article from the Los Angeles Times Similarity Measure : How many words are common in these documents ( after excluding some common words ) prithwis mukerjee, ph.d.
  • 20. Clustering of S&P Stock Data ● ● ● ● Observe Stock Movements every day. Clustering points: Stock{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. prithwis mukerjee, ph.d.
  • 21. Regression ● Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. ○ Greatly studied in statistics, neural network fields. ● Examples: ○ Predicting sales amounts of new product based on advertising expenditure. ○ Predicting wind velocities as a function of temperature, humidity, air ○ pressure, etc. Time series prediction of stock market indices. prithwis mukerjee, ph.d.
  • 22. Data Mining : Association Rules Mining Association Rules ● ● which products should be kept along with other products which two products should never be discounted together prithwis mukerjee, ph.d.
  • 23. Visualisation : The need to tell a story prithwis mukerjee, ph.d.
  • 24. Visualisation : The need to tell a story prithwis mukerjee, ph.d.
  • 25. Definitions Data Mining ● ● Is the process of extracting unknown, valid and actionable information from large databases and using this to make business decisions Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data prithwis mukerjee, ph.d. Data Science is a rare combination of multiple skills that include ● Technology : obviously ! but also ● ● ● Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 27. R : Your first step into Data Science prithwis mukerjee, ph.d. Try out this free interactive tutorial just now
  • 28. Statistical Tools prithwis mukerjee, ph.d. http://r4stats.com/articles/popularity/
  • 30. Map Reduce ● ● ● Input : A set of (key, value) pairs User supplies two functions ○ Map (k,v) => List(k1,v1) ○ Reduce (k1, list(v1)) => v2 Output is the set of (k1,v2) pairs prithwis mukerjee, ph.d.
  • 31. Hadoop A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● ● the Map and Reduce functions loading data into HDFS 1. 2. 3. 4. prithwis mukerjee, ph.d. HIVE a. A plug-in that allows one to use SQL like queries that are converted into map-reduce jobs PIG a. A scripting language for writing long queries HBASE a. A non-relational DBMS SQOOP a. moves data to andfrom HDFS
  • 33. JavaScript for Data Visualisation prithwis mukerjee, ph.d.
  • 34. Business Domain ● ● Financial Sector ○ Risk Management, Credit Scoring ○ Predict Customer Spend ○ Stock and Investment Analysis ○ Loan approval Telecom Sector ○ Fraud Detection ○ Churn Prediction prithwis mukerjee, ph.d. ● ● Retail and Marketing ○ Market segmentation ○ Promotional strategy ○ Market Basket Analysis ○ Trend Analysis Healthcare & Insurance ○ Fraud Detection ○ Drug Development ○ Medical Diagnostic Tools
  • 35. Conclusion ● ● ● ● Why data science ? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains Data Science is a rare combination of multiple skills that include ● but also ● ● ● prithwis mukerjee, ph.d. Technology : obviously ! Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 37. Thank You Contact This presentation is accessible at at the blog Prithwis Mukerjee Professor, Praxis Business School http://blog.yantrajaal.com prithwis@praxis.ac.in at the following URL http://bit.ly/pm-datascience prithwis mukerjee, ph.d.