By
Dr. A.V.Krishna Prasad
Associate Professor, IT Dept, MVSR EC.
kpvambati@gmail.com
Data Analytics
&
Time Series Analysis
Importance of Topic
 Can you imagine a day without electricity?
 Can you imagine a day without Computer / smart
phone / mobile?
 Can you imagine a day without Internet?
 Can your imagine a day without doing Analysis /
Analytics concept in any work?
 Data Science & Analytics
Data Analytics
MathematicalStatistics
Computer Science Applications
One Programming
Language
Data Base Technologies
AI
ML
Data Mining
SPM
Linear Algebra
Probability
Statistics
Real life applications
1. Insurance
2. Banking
3. Telecom Churn
4. Social Media
5. Stock Market
6. Financial account
7. Recommendation
Systems
Analysis & Analytics
Analytics is a process of inspecting, cleaning,
transforming, and modelling big data with the
goal of discovering useful information,
suggesting conclusions, and supporting
decision making.
Data analysis is the process by which data
becomes understanding, knowledge and
insight.
Analysis – Knowledge expert, skilled person, domain knowledge
required to do decision making.
Analytics – Naive user – Automating the decision making process.
Connection to data mining
–Analytics include both data analysis (mining) and communication
(guide to decision making)
–Analytics is not so much concerned with individual analyses or
analysis steps, but with the entire methodology
•Analytics should act like an Extra Brain / Extra Eye / Extra ear /
Extra Sensor like Sixth sense to an Organization.
•It’s an Visualization Tool (Dashboard)
Differences b/n Analytics and Analysis
Data Storage Terminology
Data
Data - Numeric, Character, Integer, Real, Rational, Discrete,
continuous, Binary, Interval Variable, Scaled , ordinal, rational
Catagorical ….
Data – Univariate, Bivariate, Multi Variate , --
Recent Data Trends - RFID Data, Web Term Data, Sensor Array
Data, Gene Expression Data, Consumer Preference Data, Symbols,
Social Media data etc.
Emoticons - Smiley, Angry
Data - are encodings that represent the qualitative or quantitative
attributes of a variable or set of variables.
Data is comprised of facts and statistics collected together for
reference or analysis.
Viewing the Data
Data – Object type
Array
List
Table
Matrix
Vector
Data Frame
One Dimensional
Two Dimensional
Multi-Dimensional
Multi-Dimensional Data as Three-Field Table
versus Two-Dimensional Matrix
Multi-Dimensional Data as Three-Field Table
versus Two-Dimensional Matrix
Multi-Dimensional Data as Four-Field Table
versus Three-Dimensional Cube
Multi-Dimensional Data as Four-Field Table
versus Three-Dimensional Cube
BIG DATA
Big Data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze.
Big Data refers to data sets grow so large and complex that it
is difficult to capture, store, manage, share, analyze and
visualize with current computational architecture.
Goals: To discover new opportunities, measure
efficiencies uncover relationships
Big Data
•1. The Data increases continuously
•Ex: Angry Birds – Mobile App
•( After downloading from millions of people – back end
database – users, levels, scores, functionality, speed etc.)
•2.Structure / Unstructure
•Example for Dbms - Excel sheet
• RDBMS – DB2, Informix etc
•Data type for faceboook – emails, smileys, xml data, audio
video
Big Data
•3. Difficult to Analyze
(Blue Ray disc – 2 GB Movie - 2 hrs time to watch and analyze
1 PB -- 10 years
Facebook – Generates 300 PB every month. Youtube – CCTV – videos etc)
4. With in certain tolerable time limit
(Facebook, Youtube etc usage on windows, MacOS, unix platform..
Can you visualize how many users are working on windows, MacOS, unix platform
from last 2 months in one hour )
Types of Analytics
Data Science refers to gain insights into
data through computation, statistics, and
visualization.
A Data Scientist Is... someone who knows
more statistics than a computer scientist and
more computer science than a statistician.”
- Josh Blumenstock
“Data Scientist = Statistician + Programmer +
Coach + Storyteller + Artist”.
- Shlomo Aragmon
● Quantitative skill:
such as mathematics or statistics
● Technical aptitude:
namely, software engineering, machine learning, and programming skills
● Skeptical mind-set and critical thinking:
It is important that data scientists can examine their work
critically rather than in a one-sided way.
● Curious and creative:
Data scientists are passionate about data and finding creative ways to
solve problems and portray information.
● Communicative and collaborative:
Data scientists must be able to articulate the business value in a clear
way and collaboratively work with other groups, including project
sponsors and key stakeholders.
Data Scientist - Characteristics
Example of Big Data Analytics
After analyzing consumer purchasing behavior,
Target’s statisticians determined that the retailer
made a great deal of money from three main life-
event situations.
Example of Big Data Analytics Problem
After analyzing consumer purchasing behavior,
Target’s statisticians determined that the retailer
made a great deal of money from three main life-
event situations.
● Marriage, when people tend to buy many new
products
● Divorce, when people buy new products and
change their spending habits
● Pregnancy, when people have many new things to
buy and have an urgency to buy them
Why We are giving importance to
Business Analytics
• Product View: (19th Century)
Suppliers & Customers
• Managerial View: (20th Century)
Suppliers, Customers, Owners & Employees
• Business Intelligence View: (1960’s to 1990’s)
Suppliers, Customers, Owners, Employees,
Competitors, Government & Environmental view
• Next Generation Business Intelligence View:
(current – ANALYTICS View)
Suppliers, Customers, Owners, Employees,
Competitors, Government, Environment, Online
communities, news, media, International Partners, &
Multinational Companies.
Started with Exchange
of Goods next Goods
selling
Evolution of Business Analytics
Some common types of decisions that can be enhanced by using analytics
include
• Pricing (for example, setting prices for consumer and industrial goods,
government contracts, and maintenance contracts).
• Customer Segmentation (for example, identifying and targeting key
customer groups in retail, insurance, and credit card industries).
ADVANTAGES:
 BA increases profitability, Shareholders returns
 BA enhances understanding of data
 BA is vital for businesses to remain competitive
 BA enables creation of informative reports
Four Types Data Based on Measurement Scale:
 Categorical (nominal) data
 Ordinal data
 Interval data
 Ratio data
Data Availability
Example 1.3
Classifying Data Elements in a Purchasing Database
Data for Business Analytics
Figure 1.2
Example 1.3 (continued)
Classifying Data Elements in a Purchasing Database
Data for Business Analytics
Figure 1.2
Categorical (nominal) Data
 Data placed in categories according to a specified
characteristic
 Categories bear no quantitative relationship to one another
 Examples:
- customer’s location (America, Europe, Asia)
- employee classification (manager, supervisor,
associate)
Data for Business Analytics
Ordinal Data
 Data that is ranked or ordered according to some
relationship with one another
 No fixed units of measurement
 Examples:
- college football rankings
- survey responses
(poor, average, good, very good, excellent)
Data for Business Analytics
Interval Data
 Ordinal data but with constant differences between
observations
 No true zero point
 Ratios are not meaningful
 Examples:
- temperature readings
- SAT scores
Data for Business Analytics
Ratio Data
 Continuous values and have a natural zero point
 Ratios are meaningful
 Examples:
- monthly sales
- delivery times
Data for Business Analytics
Variable Scoring Nominal Ordinal Continuous
Quality of
life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
Ethnicity 1 = Non-Hispanic
2 = Hispanic
Race 1 = African American
2 = Caucasian
3 = Other
Diabetes 1 = Absent
2 = Present
Systolic BP Ranges from 95 to
190 mmHg
Variable Types Identify the correct type(s):
Variable Scoring Nominal Ordinal Contiuous
Quality of
life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
●
Ethnicity 1 = Tribal
2 = Religious ●
Race 1 = Red Indian
2 = Block Indian
3 = Other
●
Diabetes 1 = Absent
2 = Present
●
Systolic BP Ranges from 95 to
190 mmHg ●
Variable Types Identify the correct type(s):
-------Python….- <-------R….-:
Web References
 https://cran.r-project.org/ (For Software –R)
 https://www.rstudio.com/ (For R studio – GUI)
 http://www.r-tutor.com/ (For Basic Learners)
 https://www.datacamp.com/community/tutorials/fun
ctions-in-r-a-tutorial
(For Scripts)
 https://www.analyticsvidhya.com/
(For Advanced level – Mining, Machine Learning..)
General Programming Components
Program Structure, Syntax, Semantics, Execution part
Comments, Data Storage Elements, Data Types,
I/O Statements, Control Statements, Arrays,
Structures, Functions, Files etc.
R Packages
Specialized functions, its Usage
and its Importance
Implementation in R
R Programming PPT
Descriptive statistics
Predictive statistics
Visualization
DATA VISUALIZATION IN R
Basic Visualization
•Histogram
•Bar / Line Chart
•Box plot
•Scatter plot
Advanced Visualization
Heat Map
Mosaic Map
Map Visualization
3D Graphs
Correlogram
Demo(graphics) - for Demo in R tool
Library – ggplot2, RColorBrewer Package : HistData,
Five Steps of a Machine Learning Project Lifecycle
 1. Obtain Data
 2. Scrub Data
 3. Explore Data
 4. Model Data
 5. Interpreting Data
Project Development Life Cycle
Five Steps of a Machine Learning Project Lifecycle
 1. Obtain Data – Data Collection, Sources
 2. Scrub Data - Data Cleaning
 3. Explore Data – Descriptive Statistics
 4. Model Data - Model Fitting
 5. Interpreting Data - Results
Project Development Life Cycle
Project Life Cycle
Project Development Life Cycle
1. Obtain Data
Data Collection:
Querying from database, Flat files, Excel Data
Unstructured – No Sql, Mongo-DB
Social Media Data – Facebook, Whatsapp, Twitter
Websites data – Web srapping, beautiful soap,
Web-API, IOT sensor Data
Data sets from websites like Kaggle, Kdnuggets, ISRO, NRSC,
Bhuvan..
Formats – CSV, TSV, Special sparser format
Project Development Life Cycle
Sensors which generate data
Project Development Life Cycle
Inputs are digitalized and place on to the network
Project Development Life Cycle
Project Development Life Cycle
2. Scrub Data (Data Cleaning, Pre-processing)
Text Cleaning Techniques
•Make all text lower case
•Removing Punctuation
•Removal of Stop Words
•Removing URLs
•Remove html HTML tags
•Removing Emojis, Emoticons
Garbage in –> Garbage out;
No quality Data -> No Quality Results
Data Consolidation, Data Cleaning, Handling Missing Values, Outliers
Dimensionality Reduction, Smoothing, Normalization
Image Pre-processing
•Noise Removal
•Image Filtering
Project Development Life Cycle
Feature / Variable creation is a process to generate a new
variables / features based on existing variable(s).
Handling
Missing data
Deletion
Imputation
Deleting rows
(Listwise Deletion)
Pairwise deletion
Deleting columns
General
problem
Time-series
problem
Data without trend &
Without seasonality
Data with trend &
Without seasonality
Data with trend &
With seasonality
Categorical
Continuous
Mean, Median, Mode, Random
Sample imputation
Linear interpolation
Seasonal Adjustment+
interpolation
Make NA as level, Multiple
Imputation, logistic Regression
Mean, median, mode, multiple
Imputation, linear Regression
Feature/ Variable creation
Rank of Matrix –
no. of featured Attributes
Project Development Life Cycle
3. Explore Data
Inspect the data and its properties.
Different data types like numerical data, categorical data,
ordinal and nominal data etc. require different treatments.
Descriptive Analysis, Understanding of Data
Apply Descriptive Statistics to understand the data.
Significance Test,
Distributions, Inferential Statistics
Project Development Life Cycle
Types of Analytics
Project Development Life Cycle
Project Development Life Cycle
4. Model Data
Project Development Life Cycle
 Predictive Decision Models often incorporate
uncertainty to help managers analyze risk.
 Aim to predict what will happen in the future.
 Uncertainty is imperfect knowledge of what will
happen in the future.
 Risk is associated with the consequences of
what actually happens.
Decision Models
Project Development Life Cycle
Machine Learning Types
Supervised Learning
Continuous
Target variable
Categorical
Target variable
Unsupervised
Learning
Target variable not available Categorical Target variable
Semi-
supervised
Learning
Regression Classification Clustering Association Classification Clustering
Machine Learning Algorithms…
Project Development Life Cycle
Response
Variable
Continuous
Response
Variable
categorical
No Response
Variable
Predictor
Variable-
continuous
Linear
regression
Neural
network K-
nearest
Neighbor
(KNN)
Linear regression
Neural network
Logistic regression
KNN
Neural network
Decision/
classification
Trees
logistic
regression
Naïve bayes
Cluster
analysis,
Principal
Component
Analysis
Association rule
Machine Learning Algorithms…
Predictor
Variable-
categorical
Project Development Life Cycle
PREDICTION TECHNIQUES
Regression models
Regression Analysis attempts to explain the influence that a set of variables on the
outcome of another variable of interest. The outcome variable is called dependent
variable. The additional variables are called independent or input variables.
Linear Regression Model –
(Sum of squares of variance or least square)
Ex: What is the persons expected income?
Logistic Regression Model – Maximum Likely hood
Ex: What is the probability that an applicant will default a loan?
Project Development Life Cycle
PREDICTION TECHNIQUES
Regression models
ARMA - Auto Regression Moving Average Models
ARIMA -Auto Regression Integrative Moving Average Models
Holt-winter
Fuzzy Logic,
Neural network
Genetic Algorithm
LASSO – Least Absolute Shrinkage and Selection Operator
Project Development Life Cycle
DEEP LEARNING
Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating
patterns for use in decision making. Deep learning is a subset
of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is
unstructured or unlabeled. Also known as deep neural learning or
deep neural network.
1. ANN
2. CNN
3. RNN
5. Interpreting Data
Interpreting data refers to the presentation of your data to a non-technical
layman.
Actionable insight is a key outcome from a data science project.
Need strong business domain knowledge to present your findings in a way
that can answer the business questions you set out to answer, and translate
them into actionable steps.
Data Visualization – R – Ggplot-2
Python – matplotlib, ggplotlib, seaborn
Project Development Life Cycle
INSIGHTS OF DATA SCIENCE
Insight is the value obtained through the use of analytics. The
insights gained through analytics are incredible powerful, and can be
used to grow your business while identifying areas of opportunity.
Insight is What is learned and what will improve your business.
The True Power of Insights-Driven Marketing :
The real value of data and analytics lie in their ability to deliver rich
insights.
The best insights are actionable and prescriptive - they can be used to
take immediate action that will improve your business and will
inform your future path.
Project Development Life Cycle
INSIGHTS OF DATA SCIENCE
2020 – 2050 - Data Centric World.
Statisticians, Data Scientists will rule the world.
All Real world projects are related with Data only. Several data science
Projects are coming.
Industry recruiting Data Science related personnel as Statistician,
Business Analyst, Data Analyst, Data Engineer, Data Architect,
Information Architect, Data Scientist.. Etc.
Data Science as a new branch started in B.Tech (Data Science),
MS/M.Tech (Data Science), PhD (Data Science)
Data Science evolved as a separate field in Science. Industry and
Academia accepted and started as new branch.
Project Development Life Cycle
Sample Works
1. Forecasting on $ Rate
2. Forecasting on Ionoshperic data
(R Commands for plotting, Coloring, Graph
Labeling)
3. Weather Forecasting
Social Media Analytics
1. Twitter Analytics
2. Face book Analytics
3. Google Analytics
4. Web Analytics
Tools
Tracking and reporting social media analytics used to be a
hurdle for digital marketers – now the problem is finding
the ideal tool.
Twitter Analytics using R
Public handles
#namo, #kejri wal, #ipl, #climate
Packages used :
twitteR - Provides Access to Twitter Data
tm – Provides functions for Text Processing
wordcloud – visualize the results with wordcloud
1. Extract Tweets from Twitter.
2. Text Processing
3. Apply Analytical Methods
4. Visualization
https://developer.twitter.com/
Consumer Key and Consumer Secret number , Access Token
Authentication key and number
Text Processing:
R packages used in Social Media Lab (planned)
• twitteR (for collecting Twitter data)
• tm (text mining)
• wordcloud (text word clouds)
• RTextTools (machine learning package for automatic text
classification)
• igraph (network analysis and visualization)
• RCurl (collecting WWW data)
• XML (reading and creating XML documents)
• R.utils (programming utilities)
• ape and dendextend (dendo grams, hierarchical clustering)
• FactoMineR and homals (multiple correspondence analysis)
• plyr and stringr (text sentiment analysis
Twitter Analytics using R
IMPLEMENTATION
in R
Topic modeling is a type of statistical modeling for
discovering the abstract “topics” that occur in a collection of
documents.
Latent Dirichlet Allocation (LDA) is an example of topic
model and is used to classify text in a document to a
particular topic.
It builds a topic per document model and words per topic
model, modeled as Dirichlet distributions.
PREDICTION TECHNIQUES
Regression models
Regression Analysis attempts to explain the influence that a set
of variables on the outcome of another variable of interest. The
outcome variable is called dependent variable. The additional
variables are called independent or input variables.
Linear Regression Model –
(Sum of squares of variance or least square)
Ex: What is the persons expected income?
Logistic Regression Model – Maximum Likely hood
Ex: What is the probability that an applicant will default a
loan?
PREDICTION TECHNIQUES
Regression models
ARMA - Auto Regression Moving Average Models
ARIMA -Auto Regression Integrative Moving Average Models
Holt-winter
Fuzzy Logic,
Neural network
Genetic Algorithm
LASSO – Least Absolute Shrinkage and Selection Operator
Face book:
https://developers.facebook.com/
https://analytics.facebook.com/
https://business.facebook.com/
Twitter
https://developer.twitter.com/
https://analytics.twitter.com/
https://business.twitter.com/
Time Series Analysis
Time series analysis plays a major role in business analytics.
Time series data can be defined as quantities that trace the values
taken by a variable over a period such as month, quarter or year.
For example, in share market, the price of shares changes every
second.
Another example of time series data is measuring the level of
unemployment each month of the year.
Time Series Analysis
Univariate and multivariate are two types of time series data.
When time series data uses a single quantity for describing values, it
is termed univariate.
when time series data uses more than a single quantity for
describing values, it is called multivariate.
R language provides many commands for Data Visualization such
as plot(), hist(), pie(), boxplot(), stripchart(), curve(), abline(),
qqnorm(), etc.
plot() and hist() commands are mostly used in time series analysis.
Time Series Analysis
Linear Filtering of Time Series :
A simple component analysis divides the data into four main
components called trend, seasonal, cyclical and irregular.
Each component has its own special feature.
For example, the trend, seasonal, cyclical and irregular components
define the long-term progression, the seasonal variation, the
repeated but non-periodic fluctuations, and the random or irregular
components of any time series, respectively.
Time Series Analysis
Decomposing time series Data :
Decomposing time series data is also a part of the simple component
analysis that defines four components, viz., trend, seasonal,
cyclical and irregular.
Forecasts Using exponential smoothing :
Forecasts are a type of prediction that predict future events from past
data. Here, the forecast process uses exponential smoothing for
making predictions.
An exponential smoothing method finds out the changes in time series
data by ignoring the irrelevant fluctuations and makes the short-term
forecast prediction for time series data.
Time Series Analysis
1. What do you mean by exponential smoothing?
Ans: Exponential smoothing method finds out the changes in time series
data by ignoring the irrelevant fluctuations and makes the short-term
forecast prediction for time series data.
2. What is the HoltWinters() function?
Ans: The HoltWinters() function is an inbuilt function, commonly used
for finding exponential smoothing. All three types of exponential
smoothing use the HoltWinters() function but with different parameters.
The HoltWinters() function returns a value between 0 and 1 for all three
parameters, viz., alpha, beta and gamma.
3. What is the function of Holt’s exponential smoothing?
Ans: Holt’s exponential smoothing estimates the level and slope at the
current time point.
The alpha and beta parameters of the HoltWinters() function control it
and estimates the level and slope, respectively.
Time Series Analysis - Summary
Time series data is a type of data that is stored at regular intervals or
follows the concept of time series.
According to the statistic, a time series defines the sequence of
numerical data points in successive order.
• A multivariate time series is a type of time series that uses more
than a single quantity for describing the values.
• The plot(), hist(), boxplot(), pie(), abline(), qqnorm(), stripcharts()
and curve() are some R commands used for visualisation of time
series data.
• The mean(), sd(), log(), diff(), pnorm() and qnorm() are some R
commands used for manipulation of time series data.
Time Series Analysis - Summary
• The simple component analysis divides the data into four main
components named trend, seasonal, cyclical and irregular. The
linear filter is a part of simple component analysis.
• Linear filtering of time series uses linear filters for generating the
different components of the time series. Time series analysis
mostly uses a moving average as a linear filter.
• A filter() function performs linear filtering of time series data and
generates the time series of the given data.
• A scan() function reads the data from any file. Since time series
data contains data with respect to a successive time interval, it is
the best function for reading it.
Time Series Analysis - Summary
• A ts() function stores time series data and creates a time
series object.
• The as.ts() function converts a simple object into time
series object and the is.ts() function checks whether an
object is a time series object or not.
• The plotting task represents time series data graphically
during time series analysis and R provides the plot()
function for plotting of time series data.
• A non-seasonal time series contains the trend and irregular
components; hence, the decomposition process converts
the non-seasonal data into these components.
Time Series Analysis - Summary
• The SMA() function is used for decomposition of non-
seasonal time series. It smooths time series data by
calculating the moving average and estimates the trend and
irregular components.
• The function is available in the package “TTR”.
• A seasonal time series contains the seasonal, trend and
irregular component; hence, the decomposition process
converts the seasonal data into these three components.
• The seas() function automatically finds out the seasonally
adjusting series. The function is available in the “seasonal”
package.
Time Series Analysis - Summary
• Regression analysis defines the linear relationship between
independent variables (predictors) and dependent (response)
variables using a linear function.
• Forecasts are a type of prediction that predicts the future events
from the past data.
• Simple exponential smoothing estimates the level at the current
time point and performs the short-term forecast. The alpha
parameter of the HoltWinters() function controls the simple
exponential smoothing.
• Holt’s exponential smoothing estimates the level and slope at the
current time point. The alpha and beta parameters of the
HoltWinters() function controls it and estimates the level and
slope, respectively.
Time Series Analysis - Summary
• Holt-Winters exponential smoothing estimates the level, slope
and seasonal component at the cur_x0002_rent time point. The
alpha, beta and gamma parameters of the HoltWinters() function
controls it and estimates the level, slope of trend component and
seasonal component, respectively.
• The ARIMA (Autoregressive Integrated Moving Average) is
another method of time series forecasting.
• The ARIMA model explicitly defines the irregular component of
a stationary time series with non_x0002_zero autocorrelation. It
is represented by ARIMA(p, d, q) where parameters p, d and q
define the autoregression (AR) order, the degree of differencing
and the moving average (MA) order, respectively.
Time Series Analysis - Summary
• A diff() function is differencing a time series for finding an
appropriate ARIMA model. It helps to obtain a stationary time
series for an ARIMA model and also finds out the value of ‘d’
parameter of an ARIMA(p, q, r) model.
• An auto.arima() function automatically returns the best candidate
ARIMA(p, d, q) model for the given time series.
• An arima() function finds out the parameters of the selected
ARIMA(p, d, q) model.
• The arima.sim() function simulates the values of the
autocorrelation and partial autocorrelation from an ARIMA(p, d,
q) model.
 IBM Watson Analytical Tool
 Developer.twitter.com
 Developer.google.com
 Developer.ibm
 API
 Amazon Web services
Eventhough Chat GPT..
SELF ANALYTICS
• Stay Home Stay Safe
• Do Yoga, Stay Healthy ( Physical Fitness)
( Hand Movements, Neck Exercises, Eye Exercise, Surya Namaskar..)
• Do Breathing Exercises – Improves your Concentration power
• Maintain Social distance with electronic devices at least 2 hrs per day
( 1 hour before going to bed and 1 hour after coming from bed)
After using mobile, Computer – Practice Gandhari exercise.
• Practice Meditation – Move towards spirituality.
The purpose of LIFE is
“Awakening with in You and Awakening
the Others”
“Life Long Loyal Love for Learning”
Kp-Data Analytics-ts.pptx

Kp-Data Analytics-ts.pptx

  • 1.
    By Dr. A.V.Krishna Prasad AssociateProfessor, IT Dept, MVSR EC. kpvambati@gmail.com Data Analytics & Time Series Analysis
  • 2.
    Importance of Topic Can you imagine a day without electricity?  Can you imagine a day without Computer / smart phone / mobile?  Can you imagine a day without Internet?  Can your imagine a day without doing Analysis / Analytics concept in any work?  Data Science & Analytics
  • 3.
    Data Analytics MathematicalStatistics Computer ScienceApplications One Programming Language Data Base Technologies AI ML Data Mining SPM Linear Algebra Probability Statistics Real life applications 1. Insurance 2. Banking 3. Telecom Churn 4. Social Media 5. Stock Market 6. Financial account 7. Recommendation Systems
  • 4.
    Analysis & Analytics Analyticsis a process of inspecting, cleaning, transforming, and modelling big data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis is the process by which data becomes understanding, knowledge and insight.
  • 5.
    Analysis – Knowledgeexpert, skilled person, domain knowledge required to do decision making. Analytics – Naive user – Automating the decision making process. Connection to data mining –Analytics include both data analysis (mining) and communication (guide to decision making) –Analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology •Analytics should act like an Extra Brain / Extra Eye / Extra ear / Extra Sensor like Sixth sense to an Organization. •It’s an Visualization Tool (Dashboard) Differences b/n Analytics and Analysis
  • 6.
  • 7.
    Data Data - Numeric,Character, Integer, Real, Rational, Discrete, continuous, Binary, Interval Variable, Scaled , ordinal, rational Catagorical …. Data – Univariate, Bivariate, Multi Variate , -- Recent Data Trends - RFID Data, Web Term Data, Sensor Array Data, Gene Expression Data, Consumer Preference Data, Symbols, Social Media data etc. Emoticons - Smiley, Angry Data - are encodings that represent the qualitative or quantitative attributes of a variable or set of variables. Data is comprised of facts and statistics collected together for reference or analysis.
  • 8.
    Viewing the Data Data– Object type Array List Table Matrix Vector Data Frame One Dimensional Two Dimensional Multi-Dimensional
  • 9.
    Multi-Dimensional Data asThree-Field Table versus Two-Dimensional Matrix
  • 10.
    Multi-Dimensional Data asThree-Field Table versus Two-Dimensional Matrix
  • 11.
    Multi-Dimensional Data asFour-Field Table versus Three-Dimensional Cube
  • 12.
    Multi-Dimensional Data asFour-Field Table versus Three-Dimensional Cube
  • 13.
    BIG DATA Big Datarefers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. Big Data refers to data sets grow so large and complex that it is difficult to capture, store, manage, share, analyze and visualize with current computational architecture. Goals: To discover new opportunities, measure efficiencies uncover relationships
  • 14.
    Big Data •1. TheData increases continuously •Ex: Angry Birds – Mobile App •( After downloading from millions of people – back end database – users, levels, scores, functionality, speed etc.) •2.Structure / Unstructure •Example for Dbms - Excel sheet • RDBMS – DB2, Informix etc •Data type for faceboook – emails, smileys, xml data, audio video
  • 15.
    Big Data •3. Difficultto Analyze (Blue Ray disc – 2 GB Movie - 2 hrs time to watch and analyze 1 PB -- 10 years Facebook – Generates 300 PB every month. Youtube – CCTV – videos etc) 4. With in certain tolerable time limit (Facebook, Youtube etc usage on windows, MacOS, unix platform.. Can you visualize how many users are working on windows, MacOS, unix platform from last 2 months in one hour )
  • 16.
  • 17.
    Data Science refersto gain insights into data through computation, statistics, and visualization. A Data Scientist Is... someone who knows more statistics than a computer scientist and more computer science than a statistician.” - Josh Blumenstock “Data Scientist = Statistician + Programmer + Coach + Storyteller + Artist”. - Shlomo Aragmon
  • 18.
    ● Quantitative skill: suchas mathematics or statistics ● Technical aptitude: namely, software engineering, machine learning, and programming skills ● Skeptical mind-set and critical thinking: It is important that data scientists can examine their work critically rather than in a one-sided way. ● Curious and creative: Data scientists are passionate about data and finding creative ways to solve problems and portray information. ● Communicative and collaborative: Data scientists must be able to articulate the business value in a clear way and collaboratively work with other groups, including project sponsors and key stakeholders. Data Scientist - Characteristics
  • 19.
    Example of BigData Analytics After analyzing consumer purchasing behavior, Target’s statisticians determined that the retailer made a great deal of money from three main life- event situations.
  • 20.
    Example of BigData Analytics Problem After analyzing consumer purchasing behavior, Target’s statisticians determined that the retailer made a great deal of money from three main life- event situations. ● Marriage, when people tend to buy many new products ● Divorce, when people buy new products and change their spending habits ● Pregnancy, when people have many new things to buy and have an urgency to buy them
  • 21.
    Why We aregiving importance to Business Analytics
  • 22.
    • Product View:(19th Century) Suppliers & Customers • Managerial View: (20th Century) Suppliers, Customers, Owners & Employees • Business Intelligence View: (1960’s to 1990’s) Suppliers, Customers, Owners, Employees, Competitors, Government & Environmental view • Next Generation Business Intelligence View: (current – ANALYTICS View) Suppliers, Customers, Owners, Employees, Competitors, Government, Environment, Online communities, news, media, International Partners, & Multinational Companies. Started with Exchange of Goods next Goods selling Evolution of Business Analytics
  • 23.
    Some common typesof decisions that can be enhanced by using analytics include • Pricing (for example, setting prices for consumer and industrial goods, government contracts, and maintenance contracts). • Customer Segmentation (for example, identifying and targeting key customer groups in retail, insurance, and credit card industries). ADVANTAGES:  BA increases profitability, Shareholders returns  BA enhances understanding of data  BA is vital for businesses to remain competitive  BA enables creation of informative reports
  • 24.
    Four Types DataBased on Measurement Scale:  Categorical (nominal) data  Ordinal data  Interval data  Ratio data Data Availability
  • 25.
    Example 1.3 Classifying DataElements in a Purchasing Database Data for Business Analytics Figure 1.2
  • 26.
    Example 1.3 (continued) ClassifyingData Elements in a Purchasing Database Data for Business Analytics Figure 1.2
  • 27.
    Categorical (nominal) Data Data placed in categories according to a specified characteristic  Categories bear no quantitative relationship to one another  Examples: - customer’s location (America, Europe, Asia) - employee classification (manager, supervisor, associate) Data for Business Analytics
  • 28.
    Ordinal Data  Datathat is ranked or ordered according to some relationship with one another  No fixed units of measurement  Examples: - college football rankings - survey responses (poor, average, good, very good, excellent) Data for Business Analytics
  • 29.
    Interval Data  Ordinaldata but with constant differences between observations  No true zero point  Ratios are not meaningful  Examples: - temperature readings - SAT scores Data for Business Analytics
  • 30.
    Ratio Data  Continuousvalues and have a natural zero point  Ratios are meaningful  Examples: - monthly sales - delivery times Data for Business Analytics
  • 31.
    Variable Scoring NominalOrdinal Continuous Quality of life 1 = Poor 2 = Fair 3 = Average 4 = Good 5 = Very Good Ethnicity 1 = Non-Hispanic 2 = Hispanic Race 1 = African American 2 = Caucasian 3 = Other Diabetes 1 = Absent 2 = Present Systolic BP Ranges from 95 to 190 mmHg Variable Types Identify the correct type(s):
  • 32.
    Variable Scoring NominalOrdinal Contiuous Quality of life 1 = Poor 2 = Fair 3 = Average 4 = Good 5 = Very Good ● Ethnicity 1 = Tribal 2 = Religious ● Race 1 = Red Indian 2 = Block Indian 3 = Other ● Diabetes 1 = Absent 2 = Present ● Systolic BP Ranges from 95 to 190 mmHg ● Variable Types Identify the correct type(s):
  • 34.
  • 35.
    Web References  https://cran.r-project.org/(For Software –R)  https://www.rstudio.com/ (For R studio – GUI)  http://www.r-tutor.com/ (For Basic Learners)  https://www.datacamp.com/community/tutorials/fun ctions-in-r-a-tutorial (For Scripts)  https://www.analyticsvidhya.com/ (For Advanced level – Mining, Machine Learning..)
  • 36.
    General Programming Components ProgramStructure, Syntax, Semantics, Execution part Comments, Data Storage Elements, Data Types, I/O Statements, Control Statements, Arrays, Structures, Functions, Files etc. R Packages Specialized functions, its Usage and its Importance
  • 37.
    Implementation in R RProgramming PPT Descriptive statistics Predictive statistics Visualization
  • 38.
    DATA VISUALIZATION INR Basic Visualization •Histogram •Bar / Line Chart •Box plot •Scatter plot Advanced Visualization Heat Map Mosaic Map Map Visualization 3D Graphs Correlogram Demo(graphics) - for Demo in R tool Library – ggplot2, RColorBrewer Package : HistData,
  • 39.
    Five Steps ofa Machine Learning Project Lifecycle  1. Obtain Data  2. Scrub Data  3. Explore Data  4. Model Data  5. Interpreting Data Project Development Life Cycle
  • 40.
    Five Steps ofa Machine Learning Project Lifecycle  1. Obtain Data – Data Collection, Sources  2. Scrub Data - Data Cleaning  3. Explore Data – Descriptive Statistics  4. Model Data - Model Fitting  5. Interpreting Data - Results Project Development Life Cycle
  • 41.
    Project Life Cycle ProjectDevelopment Life Cycle
  • 42.
    1. Obtain Data DataCollection: Querying from database, Flat files, Excel Data Unstructured – No Sql, Mongo-DB Social Media Data – Facebook, Whatsapp, Twitter Websites data – Web srapping, beautiful soap, Web-API, IOT sensor Data Data sets from websites like Kaggle, Kdnuggets, ISRO, NRSC, Bhuvan.. Formats – CSV, TSV, Special sparser format Project Development Life Cycle
  • 43.
    Sensors which generatedata Project Development Life Cycle
  • 44.
    Inputs are digitalizedand place on to the network Project Development Life Cycle
  • 45.
  • 46.
    2. Scrub Data(Data Cleaning, Pre-processing) Text Cleaning Techniques •Make all text lower case •Removing Punctuation •Removal of Stop Words •Removing URLs •Remove html HTML tags •Removing Emojis, Emoticons Garbage in –> Garbage out; No quality Data -> No Quality Results Data Consolidation, Data Cleaning, Handling Missing Values, Outliers Dimensionality Reduction, Smoothing, Normalization Image Pre-processing •Noise Removal •Image Filtering Project Development Life Cycle
  • 47.
    Feature / Variablecreation is a process to generate a new variables / features based on existing variable(s). Handling Missing data Deletion Imputation Deleting rows (Listwise Deletion) Pairwise deletion Deleting columns General problem Time-series problem Data without trend & Without seasonality Data with trend & Without seasonality Data with trend & With seasonality Categorical Continuous Mean, Median, Mode, Random Sample imputation Linear interpolation Seasonal Adjustment+ interpolation Make NA as level, Multiple Imputation, logistic Regression Mean, median, mode, multiple Imputation, linear Regression Feature/ Variable creation Rank of Matrix – no. of featured Attributes Project Development Life Cycle
  • 48.
    3. Explore Data Inspectthe data and its properties. Different data types like numerical data, categorical data, ordinal and nominal data etc. require different treatments. Descriptive Analysis, Understanding of Data Apply Descriptive Statistics to understand the data. Significance Test, Distributions, Inferential Statistics Project Development Life Cycle
  • 49.
    Types of Analytics ProjectDevelopment Life Cycle
  • 50.
  • 51.
    4. Model Data ProjectDevelopment Life Cycle
  • 52.
     Predictive DecisionModels often incorporate uncertainty to help managers analyze risk.  Aim to predict what will happen in the future.  Uncertainty is imperfect knowledge of what will happen in the future.  Risk is associated with the consequences of what actually happens. Decision Models Project Development Life Cycle
  • 53.
    Machine Learning Types SupervisedLearning Continuous Target variable Categorical Target variable Unsupervised Learning Target variable not available Categorical Target variable Semi- supervised Learning Regression Classification Clustering Association Classification Clustering Machine Learning Algorithms… Project Development Life Cycle
  • 54.
    Response Variable Continuous Response Variable categorical No Response Variable Predictor Variable- continuous Linear regression Neural network K- nearest Neighbor (KNN) Linearregression Neural network Logistic regression KNN Neural network Decision/ classification Trees logistic regression Naïve bayes Cluster analysis, Principal Component Analysis Association rule Machine Learning Algorithms… Predictor Variable- categorical Project Development Life Cycle
  • 55.
    PREDICTION TECHNIQUES Regression models RegressionAnalysis attempts to explain the influence that a set of variables on the outcome of another variable of interest. The outcome variable is called dependent variable. The additional variables are called independent or input variables. Linear Regression Model – (Sum of squares of variance or least square) Ex: What is the persons expected income? Logistic Regression Model – Maximum Likely hood Ex: What is the probability that an applicant will default a loan? Project Development Life Cycle
  • 56.
    PREDICTION TECHNIQUES Regression models ARMA- Auto Regression Moving Average Models ARIMA -Auto Regression Integrative Moving Average Models Holt-winter Fuzzy Logic, Neural network Genetic Algorithm LASSO – Least Absolute Shrinkage and Selection Operator Project Development Life Cycle
  • 57.
    DEEP LEARNING Deep learningis an artificial intelligence function that imitates the workings of the human brain in processing data and creating patterns for use in decision making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Also known as deep neural learning or deep neural network. 1. ANN 2. CNN 3. RNN
  • 58.
    5. Interpreting Data Interpretingdata refers to the presentation of your data to a non-technical layman. Actionable insight is a key outcome from a data science project. Need strong business domain knowledge to present your findings in a way that can answer the business questions you set out to answer, and translate them into actionable steps. Data Visualization – R – Ggplot-2 Python – matplotlib, ggplotlib, seaborn Project Development Life Cycle
  • 59.
    INSIGHTS OF DATASCIENCE Insight is the value obtained through the use of analytics. The insights gained through analytics are incredible powerful, and can be used to grow your business while identifying areas of opportunity. Insight is What is learned and what will improve your business. The True Power of Insights-Driven Marketing : The real value of data and analytics lie in their ability to deliver rich insights. The best insights are actionable and prescriptive - they can be used to take immediate action that will improve your business and will inform your future path. Project Development Life Cycle
  • 60.
    INSIGHTS OF DATASCIENCE 2020 – 2050 - Data Centric World. Statisticians, Data Scientists will rule the world. All Real world projects are related with Data only. Several data science Projects are coming. Industry recruiting Data Science related personnel as Statistician, Business Analyst, Data Analyst, Data Engineer, Data Architect, Information Architect, Data Scientist.. Etc. Data Science as a new branch started in B.Tech (Data Science), MS/M.Tech (Data Science), PhD (Data Science) Data Science evolved as a separate field in Science. Industry and Academia accepted and started as new branch. Project Development Life Cycle
  • 61.
    Sample Works 1. Forecastingon $ Rate 2. Forecasting on Ionoshperic data (R Commands for plotting, Coloring, Graph Labeling) 3. Weather Forecasting
  • 62.
    Social Media Analytics 1.Twitter Analytics 2. Face book Analytics 3. Google Analytics 4. Web Analytics Tools Tracking and reporting social media analytics used to be a hurdle for digital marketers – now the problem is finding the ideal tool.
  • 63.
    Twitter Analytics usingR Public handles #namo, #kejri wal, #ipl, #climate Packages used : twitteR - Provides Access to Twitter Data tm – Provides functions for Text Processing wordcloud – visualize the results with wordcloud 1. Extract Tweets from Twitter. 2. Text Processing 3. Apply Analytical Methods 4. Visualization
  • 64.
    https://developer.twitter.com/ Consumer Key andConsumer Secret number , Access Token Authentication key and number Text Processing:
  • 65.
    R packages usedin Social Media Lab (planned) • twitteR (for collecting Twitter data) • tm (text mining) • wordcloud (text word clouds) • RTextTools (machine learning package for automatic text classification) • igraph (network analysis and visualization) • RCurl (collecting WWW data) • XML (reading and creating XML documents) • R.utils (programming utilities) • ape and dendextend (dendo grams, hierarchical clustering) • FactoMineR and homals (multiple correspondence analysis) • plyr and stringr (text sentiment analysis Twitter Analytics using R
  • 66.
  • 67.
    Topic modeling isa type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
  • 68.
    PREDICTION TECHNIQUES Regression models RegressionAnalysis attempts to explain the influence that a set of variables on the outcome of another variable of interest. The outcome variable is called dependent variable. The additional variables are called independent or input variables. Linear Regression Model – (Sum of squares of variance or least square) Ex: What is the persons expected income? Logistic Regression Model – Maximum Likely hood Ex: What is the probability that an applicant will default a loan?
  • 69.
    PREDICTION TECHNIQUES Regression models ARMA- Auto Regression Moving Average Models ARIMA -Auto Regression Integrative Moving Average Models Holt-winter Fuzzy Logic, Neural network Genetic Algorithm LASSO – Least Absolute Shrinkage and Selection Operator
  • 70.
  • 72.
    Time Series Analysis Timeseries analysis plays a major role in business analytics. Time series data can be defined as quantities that trace the values taken by a variable over a period such as month, quarter or year. For example, in share market, the price of shares changes every second. Another example of time series data is measuring the level of unemployment each month of the year.
  • 73.
    Time Series Analysis Univariateand multivariate are two types of time series data. When time series data uses a single quantity for describing values, it is termed univariate. when time series data uses more than a single quantity for describing values, it is called multivariate. R language provides many commands for Data Visualization such as plot(), hist(), pie(), boxplot(), stripchart(), curve(), abline(), qqnorm(), etc. plot() and hist() commands are mostly used in time series analysis.
  • 74.
    Time Series Analysis LinearFiltering of Time Series : A simple component analysis divides the data into four main components called trend, seasonal, cyclical and irregular. Each component has its own special feature. For example, the trend, seasonal, cyclical and irregular components define the long-term progression, the seasonal variation, the repeated but non-periodic fluctuations, and the random or irregular components of any time series, respectively.
  • 75.
    Time Series Analysis Decomposingtime series Data : Decomposing time series data is also a part of the simple component analysis that defines four components, viz., trend, seasonal, cyclical and irregular. Forecasts Using exponential smoothing : Forecasts are a type of prediction that predict future events from past data. Here, the forecast process uses exponential smoothing for making predictions. An exponential smoothing method finds out the changes in time series data by ignoring the irrelevant fluctuations and makes the short-term forecast prediction for time series data.
  • 76.
    Time Series Analysis 1.What do you mean by exponential smoothing? Ans: Exponential smoothing method finds out the changes in time series data by ignoring the irrelevant fluctuations and makes the short-term forecast prediction for time series data. 2. What is the HoltWinters() function? Ans: The HoltWinters() function is an inbuilt function, commonly used for finding exponential smoothing. All three types of exponential smoothing use the HoltWinters() function but with different parameters. The HoltWinters() function returns a value between 0 and 1 for all three parameters, viz., alpha, beta and gamma. 3. What is the function of Holt’s exponential smoothing? Ans: Holt’s exponential smoothing estimates the level and slope at the current time point. The alpha and beta parameters of the HoltWinters() function control it and estimates the level and slope, respectively.
  • 77.
    Time Series Analysis- Summary Time series data is a type of data that is stored at regular intervals or follows the concept of time series. According to the statistic, a time series defines the sequence of numerical data points in successive order. • A multivariate time series is a type of time series that uses more than a single quantity for describing the values. • The plot(), hist(), boxplot(), pie(), abline(), qqnorm(), stripcharts() and curve() are some R commands used for visualisation of time series data. • The mean(), sd(), log(), diff(), pnorm() and qnorm() are some R commands used for manipulation of time series data.
  • 78.
    Time Series Analysis- Summary • The simple component analysis divides the data into four main components named trend, seasonal, cyclical and irregular. The linear filter is a part of simple component analysis. • Linear filtering of time series uses linear filters for generating the different components of the time series. Time series analysis mostly uses a moving average as a linear filter. • A filter() function performs linear filtering of time series data and generates the time series of the given data. • A scan() function reads the data from any file. Since time series data contains data with respect to a successive time interval, it is the best function for reading it.
  • 79.
    Time Series Analysis- Summary • A ts() function stores time series data and creates a time series object. • The as.ts() function converts a simple object into time series object and the is.ts() function checks whether an object is a time series object or not. • The plotting task represents time series data graphically during time series analysis and R provides the plot() function for plotting of time series data. • A non-seasonal time series contains the trend and irregular components; hence, the decomposition process converts the non-seasonal data into these components.
  • 80.
    Time Series Analysis- Summary • The SMA() function is used for decomposition of non- seasonal time series. It smooths time series data by calculating the moving average and estimates the trend and irregular components. • The function is available in the package “TTR”. • A seasonal time series contains the seasonal, trend and irregular component; hence, the decomposition process converts the seasonal data into these three components. • The seas() function automatically finds out the seasonally adjusting series. The function is available in the “seasonal” package.
  • 81.
    Time Series Analysis- Summary • Regression analysis defines the linear relationship between independent variables (predictors) and dependent (response) variables using a linear function. • Forecasts are a type of prediction that predicts the future events from the past data. • Simple exponential smoothing estimates the level at the current time point and performs the short-term forecast. The alpha parameter of the HoltWinters() function controls the simple exponential smoothing. • Holt’s exponential smoothing estimates the level and slope at the current time point. The alpha and beta parameters of the HoltWinters() function controls it and estimates the level and slope, respectively.
  • 82.
    Time Series Analysis- Summary • Holt-Winters exponential smoothing estimates the level, slope and seasonal component at the cur_x0002_rent time point. The alpha, beta and gamma parameters of the HoltWinters() function controls it and estimates the level, slope of trend component and seasonal component, respectively. • The ARIMA (Autoregressive Integrated Moving Average) is another method of time series forecasting. • The ARIMA model explicitly defines the irregular component of a stationary time series with non_x0002_zero autocorrelation. It is represented by ARIMA(p, d, q) where parameters p, d and q define the autoregression (AR) order, the degree of differencing and the moving average (MA) order, respectively.
  • 83.
    Time Series Analysis- Summary • A diff() function is differencing a time series for finding an appropriate ARIMA model. It helps to obtain a stationary time series for an ARIMA model and also finds out the value of ‘d’ parameter of an ARIMA(p, q, r) model. • An auto.arima() function automatically returns the best candidate ARIMA(p, d, q) model for the given time series. • An arima() function finds out the parameters of the selected ARIMA(p, d, q) model. • The arima.sim() function simulates the values of the autocorrelation and partial autocorrelation from an ARIMA(p, d, q) model.
  • 84.
     IBM WatsonAnalytical Tool  Developer.twitter.com  Developer.google.com  Developer.ibm  API  Amazon Web services
  • 85.
  • 86.
    SELF ANALYTICS • StayHome Stay Safe • Do Yoga, Stay Healthy ( Physical Fitness) ( Hand Movements, Neck Exercises, Eye Exercise, Surya Namaskar..) • Do Breathing Exercises – Improves your Concentration power • Maintain Social distance with electronic devices at least 2 hrs per day ( 1 hour before going to bed and 1 hour after coming from bed) After using mobile, Computer – Practice Gandhari exercise. • Practice Meditation – Move towards spirituality.
  • 87.
    The purpose ofLIFE is “Awakening with in You and Awakening the Others” “Life Long Loyal Love for Learning”