Kp-Data Analytics-ts.pptx

By
Dr. A.V.Krishna Prasad
Associate Professor, IT Dept, MVSR EC.
kpvambati@gmail.com
Data Analytics
&
Time Series Analysis

Importance of Topic
 Can you imagine a day without electricity?
 Can you imagine a day without Computer / smart
phone / mobile?
 Can you imagine a day without Internet?
 Can your imagine a day without doing Analysis /
Analytics concept in any work?
 Data Science & Analytics

Data Analytics
MathematicalStatistics
Computer Science Applications
One Programming
Language
Data Base Technologies
AI
ML
Data Mining
SPM
Linear Algebra
Probability
Statistics
Real life applications
1. Insurance
2. Banking
3. Telecom Churn
4. Social Media
5. Stock Market
6. Financial account
7. Recommendation
Systems

Analysis & Analytics
Analytics is a process of inspecting, cleaning,
transforming, and modelling big data with the
goal of discovering useful information,
suggesting conclusions, and supporting
decision making.
Data analysis is the process by which data
becomes understanding, knowledge and
insight.

Analysis – Knowledge expert, skilled person, domain knowledge
required to do decision making.
Analytics – Naive user – Automating the decision making process.
Connection to data mining
–Analytics include both data analysis (mining) and communication
(guide to decision making)
–Analytics is not so much concerned with individual analyses or
analysis steps, but with the entire methodology
•Analytics should act like an Extra Brain / Extra Eye / Extra ear /
Extra Sensor like Sixth sense to an Organization.
•It’s an Visualization Tool (Dashboard)
Differences b/n Analytics and Analysis

Data
Data - Numeric, Character, Integer, Real, Rational, Discrete,
continuous, Binary, Interval Variable, Scaled , ordinal, rational
Catagorical ….
Data – Univariate, Bivariate, Multi Variate , --
Recent Data Trends - RFID Data, Web Term Data, Sensor Array
Data, Gene Expression Data, Consumer Preference Data, Symbols,
Social Media data etc.
Emoticons - Smiley, Angry
Data - are encodings that represent the qualitative or quantitative
attributes of a variable or set of variables.
Data is comprised of facts and statistics collected together for
reference or analysis.

Viewing the Data
Data – Object type
Array
List
Table
Matrix
Vector
Data Frame
One Dimensional
Two Dimensional
Multi-Dimensional

Multi-Dimensional Data as Three-Field Table
versus Two-Dimensional Matrix

Multi-Dimensional Data as Four-Field Table
versus Three-Dimensional Cube

BIG DATA
Big Data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze.
Big Data refers to data sets grow so large and complex that it
is difficult to capture, store, manage, share, analyze and
visualize with current computational architecture.
Goals: To discover new opportunities, measure
efficiencies uncover relationships

Big Data
•1. The Data increases continuously
•Ex: Angry Birds – Mobile App
•( After downloading from millions of people – back end
database – users, levels, scores, functionality, speed etc.)
•2.Structure / Unstructure
•Example for Dbms - Excel sheet
• RDBMS – DB2, Informix etc
•Data type for faceboook – emails, smileys, xml data, audio
video

Big Data
•3. Difficult to Analyze
(Blue Ray disc – 2 GB Movie - 2 hrs time to watch and analyze
1 PB -- 10 years
Facebook – Generates 300 PB every month. Youtube – CCTV – videos etc)
4. With in certain tolerable time limit
(Facebook, Youtube etc usage on windows, MacOS, unix platform..
Can you visualize how many users are working on windows, MacOS, unix platform
from last 2 months in one hour )

Data Science refers to gain insights into
data through computation, statistics, and
visualization.
A Data Scientist Is... someone who knows
more statistics than a computer scientist and
more computer science than a statistician.”
- Josh Blumenstock
“Data Scientist = Statistician + Programmer +
Coach + Storyteller + Artist”.
- Shlomo Aragmon

● Quantitative skill:
such as mathematics or statistics
● Technical aptitude:
namely, software engineering, machine learning, and programming skills
● Skeptical mind-set and critical thinking:
It is important that data scientists can examine their work
critically rather than in a one-sided way.
● Curious and creative:
Data scientists are passionate about data and finding creative ways to
solve problems and portray information.
● Communicative and collaborative:
Data scientists must be able to articulate the business value in a clear
way and collaboratively work with other groups, including project
sponsors and key stakeholders.
Data Scientist - Characteristics

Example of Big Data Analytics
After analyzing consumer purchasing behavior,
Target’s statisticians determined that the retailer
made a great deal of money from three main life-
event situations.

Example of Big Data Analytics Problem
After analyzing consumer purchasing behavior,
Target’s statisticians determined that the retailer
made a great deal of money from three main life-
event situations.
● Marriage, when people tend to buy many new
products
● Divorce, when people buy new products and
change their spending habits
● Pregnancy, when people have many new things to
buy and have an urgency to buy them

Why We are giving importance to
Business Analytics

• Product View: (19th Century)
Suppliers & Customers
• Managerial View: (20th Century)
Suppliers, Customers, Owners & Employees
• Business Intelligence View: (1960’s to 1990’s)
Suppliers, Customers, Owners, Employees,
Competitors, Government & Environmental view
• Next Generation Business Intelligence View:
(current – ANALYTICS View)
Suppliers, Customers, Owners, Employees,
Competitors, Government, Environment, Online
communities, news, media, International Partners, &
Multinational Companies.
Started with Exchange
of Goods next Goods
selling
Evolution of Business Analytics

Some common types of decisions that can be enhanced by using analytics
include
• Pricing (for example, setting prices for consumer and industrial goods,
government contracts, and maintenance contracts).
• Customer Segmentation (for example, identifying and targeting key
customer groups in retail, insurance, and credit card industries).
ADVANTAGES:
 BA increases profitability, Shareholders returns
 BA enhances understanding of data
 BA is vital for businesses to remain competitive
 BA enables creation of informative reports

Four Types Data Based on Measurement Scale:
 Categorical (nominal) data
 Ordinal data
 Interval data
 Ratio data
Data Availability

Example 1.3
Classifying Data Elements in a Purchasing Database
Data for Business Analytics
Figure 1.2

Example 1.3 (continued)
Classifying Data Elements in a Purchasing Database
Figure 1.2

Categorical (nominal) Data
 Data placed in categories according to a specified
characteristic
 Categories bear no quantitative relationship to one another
 Examples:
- customer’s location (America, Europe, Asia)
- employee classification (manager, supervisor,
associate)

Ordinal Data
 Data that is ranked or ordered according to some
relationship with one another
 No fixed units of measurement
 Examples:
- college football rankings
- survey responses
(poor, average, good, very good, excellent)

Interval Data
 Ordinal data but with constant differences between
observations
 No true zero point
 Ratios are not meaningful
 Examples:
- temperature readings
- SAT scores

Ratio Data
 Continuous values and have a natural zero point
 Ratios are meaningful
 Examples:
- monthly sales
- delivery times

Variable Scoring Nominal Ordinal Continuous
Quality of
life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
Ethnicity 1 = Non-Hispanic
2 = Hispanic
Race 1 = African American
2 = Caucasian
3 = Other
Diabetes 1 = Absent
2 = Present
Systolic BP Ranges from 95 to
190 mmHg
Variable Types Identify the correct type(s):

Variable Scoring Nominal Ordinal Contiuous
Quality of
life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
●
Ethnicity 1 = Tribal
2 = Religious ●
Race 1 = Red Indian
2 = Block Indian
3 = Other
●
Diabetes 1 = Absent
2 = Present
●
Systolic BP Ranges from 95 to
190 mmHg ●
Variable Types Identify the correct type(s):

-------Python….- <-------R….-:

Web References
 https://cran.r-project.org/ (For Software –R)
 https://www.rstudio.com/ (For R studio – GUI)
 http://www.r-tutor.com/ (For Basic Learners)
 https://www.datacamp.com/community/tutorials/fun
ctions-in-r-a-tutorial
(For Scripts)
 https://www.analyticsvidhya.com/
(For Advanced level – Mining, Machine Learning..)

General Programming Components
Program Structure, Syntax, Semantics, Execution part
Comments, Data Storage Elements, Data Types,
I/O Statements, Control Statements, Arrays,
Structures, Functions, Files etc.
R Packages
Specialized functions, its Usage
and its Importance

Implementation in R
R Programming PPT
Descriptive statistics
Predictive statistics
Visualization

DATA VISUALIZATION IN R
Basic Visualization
•Histogram
•Bar / Line Chart
•Box plot
•Scatter plot
Advanced Visualization
Heat Map
Mosaic Map
Map Visualization
3D Graphs
Correlogram
Demo(graphics) - for Demo in R tool
Library – ggplot2, RColorBrewer Package : HistData,

Five Steps of a Machine Learning Project Lifecycle
 1. Obtain Data
 2. Scrub Data
 3. Explore Data
 4. Model Data
 5. Interpreting Data
Project Development Life Cycle

Five Steps of a Machine Learning Project Lifecycle
 1. Obtain Data – Data Collection, Sources
 2. Scrub Data - Data Cleaning
 3. Explore Data – Descriptive Statistics
 4. Model Data - Model Fitting
 5. Interpreting Data - Results

Project Life Cycle

1. Obtain Data
Data Collection:
Querying from database, Flat files, Excel Data
Unstructured – No Sql, Mongo-DB
Social Media Data – Facebook, Whatsapp, Twitter
Websites data – Web srapping, beautiful soap,
Web-API, IOT sensor Data
Data sets from websites like Kaggle, Kdnuggets, ISRO, NRSC,
Bhuvan..
Formats – CSV, TSV, Special sparser format

Sensors which generate data

Inputs are digitalized and place on to the network

2. Scrub Data (Data Cleaning, Pre-processing)
Text Cleaning Techniques
•Make all text lower case
•Removing Punctuation
•Removal of Stop Words
•Removing URLs
•Remove html HTML tags
•Removing Emojis, Emoticons
Garbage in –> Garbage out;
No quality Data -> No Quality Results
Data Consolidation, Data Cleaning, Handling Missing Values, Outliers
Dimensionality Reduction, Smoothing, Normalization
Image Pre-processing
•Noise Removal
•Image Filtering

Feature / Variable creation is a process to generate a new
variables / features based on existing variable(s).
Handling
Missing data
Deletion
Imputation
Deleting rows
(Listwise Deletion)
Pairwise deletion
Deleting columns
General
problem
Time-series
problem
Data without trend &
Without seasonality
Data with trend &
Without seasonality
Data with trend &
With seasonality
Categorical
Continuous
Mean, Median, Mode, Random
Sample imputation
Linear interpolation
Seasonal Adjustment+
interpolation
Make NA as level, Multiple
Imputation, logistic Regression
Mean, median, mode, multiple
Imputation, linear Regression
Feature/ Variable creation
Rank of Matrix –
no. of featured Attributes

3. Explore Data
Inspect the data and its properties.
Different data types like numerical data, categorical data,
ordinal and nominal data etc. require different treatments.
Descriptive Analysis, Understanding of Data
Apply Descriptive Statistics to understand the data.
Significance Test,
Distributions, Inferential Statistics

Types of Analytics

4. Model Data

 Predictive Decision Models often incorporate
uncertainty to help managers analyze risk.
 Aim to predict what will happen in the future.
 Uncertainty is imperfect knowledge of what will
happen in the future.
 Risk is associated with the consequences of
what actually happens.
Decision Models

Machine Learning Types
Supervised Learning
Continuous
Target variable
Categorical
Target variable
Unsupervised
Learning
Target variable not available Categorical Target variable
Semi-
supervised
Learning
Regression Classification Clustering Association Classification Clustering
Machine Learning Algorithms…

Response
Variable
Continuous
Response
Variable
categorical
No Response
Variable
Predictor
Variable-
continuous
Linear
regression
Neural
network K-
nearest
Neighbor
(KNN)
Linear regression
Neural network
Logistic regression
KNN
Neural network
Decision/
classification
Trees
logistic
regression
Naïve bayes
Cluster
analysis,
Principal
Component
Analysis
Association rule
Machine Learning Algorithms…
Predictor
Variable-
categorical

PREDICTION TECHNIQUES
Regression models
Regression Analysis attempts to explain the influence that a set of variables on the
outcome of another variable of interest. The outcome variable is called dependent
variable. The additional variables are called independent or input variables.
Linear Regression Model –
(Sum of squares of variance or least square)
Ex: What is the persons expected income?
Logistic Regression Model – Maximum Likely hood
Ex: What is the probability that an applicant will default a loan?

Regression models
ARMA - Auto Regression Moving Average Models
ARIMA -Auto Regression Integrative Moving Average Models
Holt-winter
Fuzzy Logic,
Neural network
Genetic Algorithm
LASSO – Least Absolute Shrinkage and Selection Operator

DEEP LEARNING
Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating
patterns for use in decision making. Deep learning is a subset
of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is
unstructured or unlabeled. Also known as deep neural learning or
deep neural network.
1. ANN
2. CNN
3. RNN

5. Interpreting Data
Interpreting data refers to the presentation of your data to a non-technical
layman.
Actionable insight is a key outcome from a data science project.
Need strong business domain knowledge to present your findings in a way
that can answer the business questions you set out to answer, and translate
them into actionable steps.
Data Visualization – R – Ggplot-2
Python – matplotlib, ggplotlib, seaborn

INSIGHTS OF DATA SCIENCE
Insight is the value obtained through the use of analytics. The
insights gained through analytics are incredible powerful, and can be
used to grow your business while identifying areas of opportunity.
Insight is What is learned and what will improve your business.
The True Power of Insights-Driven Marketing :
The real value of data and analytics lie in their ability to deliver rich
insights.
The best insights are actionable and prescriptive - they can be used to
take immediate action that will improve your business and will
inform your future path.

INSIGHTS OF DATA SCIENCE
2020 – 2050 - Data Centric World.
Statisticians, Data Scientists will rule the world.
All Real world projects are related with Data only. Several data science
Projects are coming.
Industry recruiting Data Science related personnel as Statistician,
Business Analyst, Data Analyst, Data Engineer, Data Architect,
Information Architect, Data Scientist.. Etc.
Data Science as a new branch started in B.Tech (Data Science),
MS/M.Tech (Data Science), PhD (Data Science)
Data Science evolved as a separate field in Science. Industry and
Academia accepted and started as new branch.

Sample Works
1. Forecasting on $ Rate
2. Forecasting on Ionoshperic data
(R Commands for plotting, Coloring, Graph
Labeling)
3. Weather Forecasting

Social Media Analytics
1. Twitter Analytics
2. Face book Analytics
3. Google Analytics
4. Web Analytics
Tools
Tracking and reporting social media analytics used to be a
hurdle for digital marketers – now the problem is finding
the ideal tool.

Twitter Analytics using R
Public handles
#namo, #kejri wal, #ipl, #climate
Packages used :
twitteR - Provides Access to Twitter Data
tm – Provides functions for Text Processing
wordcloud – visualize the results with wordcloud
1. Extract Tweets from Twitter.
2. Text Processing
3. Apply Analytical Methods
4. Visualization

https://developer.twitter.com/
Consumer Key and Consumer Secret number , Access Token
Authentication key and number
Text Processing:

R packages used in Social Media Lab (planned)
• twitteR (for collecting Twitter data)
• tm (text mining)
• wordcloud (text word clouds)
• RTextTools (machine learning package for automatic text
classification)
• igraph (network analysis and visualization)
• RCurl (collecting WWW data)
• XML (reading and creating XML documents)
• R.utils (programming utilities)
• ape and dendextend (dendo grams, hierarchical clustering)
• FactoMineR and homals (multiple correspondence analysis)
• plyr and stringr (text sentiment analysis
Twitter Analytics using R

Topic modeling is a type of statistical modeling for
discovering the abstract “topics” that occur in a collection of
documents.
Latent Dirichlet Allocation (LDA) is an example of topic
model and is used to classify text in a document to a
particular topic.
It builds a topic per document model and words per topic
model, modeled as Dirichlet distributions.

Regression models
Regression Analysis attempts to explain the influence that a set
of variables on the outcome of another variable of interest. The
outcome variable is called dependent variable. The additional
variables are called independent or input variables.
Linear Regression Model –
(Sum of squares of variance or least square)
Ex: What is the persons expected income?
Logistic Regression Model – Maximum Likely hood
Ex: What is the probability that an applicant will default a
loan?

Regression models
ARMA - Auto Regression Moving Average Models
ARIMA -Auto Regression Integrative Moving Average Models
Holt-winter
Fuzzy Logic,
Neural network
Genetic Algorithm
LASSO – Least Absolute Shrinkage and Selection Operator

Face book:
https://developers.facebook.com/
https://analytics.facebook.com/
https://business.facebook.com/
Twitter
https://developer.twitter.com/
https://analytics.twitter.com/
https://business.twitter.com/

Time series analysis plays a major role in business analytics.
Time series data can be defined as quantities that trace the values
taken by a variable over a period such as month, quarter or year.
For example, in share market, the price of shares changes every
second.
Another example of time series data is measuring the level of
unemployment each month of the year.

Univariate and multivariate are two types of time series data.
When time series data uses a single quantity for describing values, it
is termed univariate.
when time series data uses more than a single quantity for
describing values, it is called multivariate.
R language provides many commands for Data Visualization such
as plot(), hist(), pie(), boxplot(), stripchart(), curve(), abline(),
qqnorm(), etc.
plot() and hist() commands are mostly used in time series analysis.

Linear Filtering of Time Series :
A simple component analysis divides the data into four main
components called trend, seasonal, cyclical and irregular.
Each component has its own special feature.
For example, the trend, seasonal, cyclical and irregular components
define the long-term progression, the seasonal variation, the
repeated but non-periodic fluctuations, and the random or irregular
components of any time series, respectively.

Decomposing time series Data :
Decomposing time series data is also a part of the simple component
analysis that defines four components, viz., trend, seasonal,
cyclical and irregular.
Forecasts Using exponential smoothing :
Forecasts are a type of prediction that predict future events from past
data. Here, the forecast process uses exponential smoothing for
making predictions.
An exponential smoothing method finds out the changes in time series
data by ignoring the irrelevant fluctuations and makes the short-term
forecast prediction for time series data.

1. What do you mean by exponential smoothing?
Ans: Exponential smoothing method finds out the changes in time series
data by ignoring the irrelevant fluctuations and makes the short-term
forecast prediction for time series data.
2. What is the HoltWinters() function?
Ans: The HoltWinters() function is an inbuilt function, commonly used
for finding exponential smoothing. All three types of exponential
smoothing use the HoltWinters() function but with different parameters.
The HoltWinters() function returns a value between 0 and 1 for all three
parameters, viz., alpha, beta and gamma.
3. What is the function of Holt’s exponential smoothing?
Ans: Holt’s exponential smoothing estimates the level and slope at the
current time point.
The alpha and beta parameters of the HoltWinters() function control it
and estimates the level and slope, respectively.

Time Series Analysis - Summary
Time series data is a type of data that is stored at regular intervals or
follows the concept of time series.
According to the statistic, a time series defines the sequence of
numerical data points in successive order.
• A multivariate time series is a type of time series that uses more
than a single quantity for describing the values.
• The plot(), hist(), boxplot(), pie(), abline(), qqnorm(), stripcharts()
and curve() are some R commands used for visualisation of time
series data.
• The mean(), sd(), log(), diff(), pnorm() and qnorm() are some R
commands used for manipulation of time series data.

• The simple component analysis divides the data into four main
components named trend, seasonal, cyclical and irregular. The
linear filter is a part of simple component analysis.
• Linear filtering of time series uses linear filters for generating the
different components of the time series. Time series analysis
mostly uses a moving average as a linear filter.
• A filter() function performs linear filtering of time series data and
generates the time series of the given data.
• A scan() function reads the data from any file. Since time series
data contains data with respect to a successive time interval, it is
the best function for reading it.

• A ts() function stores time series data and creates a time
series object.
• The as.ts() function converts a simple object into time
series object and the is.ts() function checks whether an
object is a time series object or not.
• The plotting task represents time series data graphically
during time series analysis and R provides the plot()
function for plotting of time series data.
• A non-seasonal time series contains the trend and irregular
components; hence, the decomposition process converts
the non-seasonal data into these components.

• The SMA() function is used for decomposition of non-
seasonal time series. It smooths time series data by
calculating the moving average and estimates the trend and
irregular components.
• The function is available in the package “TTR”.
• A seasonal time series contains the seasonal, trend and
irregular component; hence, the decomposition process
converts the seasonal data into these three components.
• The seas() function automatically finds out the seasonally
adjusting series. The function is available in the “seasonal”
package.

• Regression analysis defines the linear relationship between
independent variables (predictors) and dependent (response)
variables using a linear function.
• Forecasts are a type of prediction that predicts the future events
from the past data.
• Simple exponential smoothing estimates the level at the current
time point and performs the short-term forecast. The alpha
parameter of the HoltWinters() function controls the simple
exponential smoothing.
• Holt’s exponential smoothing estimates the level and slope at the
current time point. The alpha and beta parameters of the
HoltWinters() function controls it and estimates the level and
slope, respectively.

• Holt-Winters exponential smoothing estimates the level, slope
and seasonal component at the cur_x0002_rent time point. The
alpha, beta and gamma parameters of the HoltWinters() function
controls it and estimates the level, slope of trend component and
seasonal component, respectively.
• The ARIMA (Autoregressive Integrated Moving Average) is
another method of time series forecasting.
• The ARIMA model explicitly defines the irregular component of
a stationary time series with non_x0002_zero autocorrelation. It
is represented by ARIMA(p, d, q) where parameters p, d and q
define the autoregression (AR) order, the degree of differencing
and the moving average (MA) order, respectively.

• A diff() function is differencing a time series for finding an
appropriate ARIMA model. It helps to obtain a stationary time
series for an ARIMA model and also finds out the value of ‘d’
parameter of an ARIMA(p, q, r) model.
• An auto.arima() function automatically returns the best candidate
ARIMA(p, d, q) model for the given time series.
• An arima() function finds out the parameters of the selected
ARIMA(p, d, q) model.
• The arima.sim() function simulates the values of the
autocorrelation and partial autocorrelation from an ARIMA(p, d,
q) model.

 IBM Watson Analytical Tool
 Developer.twitter.com
 Developer.google.com
 Developer.ibm
 API
 Amazon Web services

SELF ANALYTICS
• Stay Home Stay Safe
• Do Yoga, Stay Healthy ( Physical Fitness)
( Hand Movements, Neck Exercises, Eye Exercise, Surya Namaskar..)
• Do Breathing Exercises – Improves your Concentration power
• Maintain Social distance with electronic devices at least 2 hrs per day
( 1 hour before going to bed and 1 hour after coming from bed)
After using mobile, Computer – Practice Gandhari exercise.
• Practice Meditation – Move towards spirituality.

The purpose of LIFE is
“Awakening with in You and Awakening
the Others”
“Life Long Loyal Love for Learning”

Kp-Data Analytics-ts.pptx

More Related Content

Similar to Kp-Data Analytics-ts.pptx

Recently uploaded

Kp-Data Analytics-ts.pptx