Big data is growing exponentially, with 2.5 quintillion bytes of data generated daily. However, 85% of organizations are unable to leverage big data for competitive advantage. Data analytics involves extracting patterns and models from preprocessed data using techniques like statistics, machine learning, and neural networks. Properly standardizing and categorizing data is important for collaborative analysis and insights. Data segmentation divides data into subsets for more efficient modeling and marketing. Statistical tests like the F-test and t-distribution help analyze differences between data sets and determine confidence levels.
2. Introduction to Big data
โข Data are everywhere.
โข IBM projects that every day 2.5 quintillion bytes of data was
generated
โข 90 percent of the data has been created in the last two years.
โข 85 percent of organizations will be unable to exploit big data for
competitive advantage.
โข 4.4 million jobs will be created around big data
3. Largest Data Sets Analysis by KDnuggets
Data Size Percentage
Less than 1 MB (12) 3.3
1.1 to 10 MB (8) 2.5
11 to 100 MB (14) 4.3
101 MB to 1 GB (50) 15.5
1.1 to 10 GB (59) 18
11 to 100 GB (52) 16
101 GB to 1 TB(59) 18
1.1 to 10 TB (39) 12
11 to 100 TB (15) 4.7
101 TB to 1 PB (6) 1.9
1.1 to 10 PB (2) 0.6
11 to 100 PB (0) 0
Over 100 PetaByte (6) 1.9
6. ANALYTICS
โข Analytics is a term that is often used interchangeably with data
science, data mining and knowledge discovery.
โข It refers to extracting useful business patterns or mathematical
decision models from a preprocessed data set.
โข Different underlying techniques can be used for this purpose,
โข Statistics (Linear and logistics regression)
โข Machine Learning (Decision tree)
โข Biology (Neural Network)
โข Kernel Methods (SVM)
7. Predictive and Descriptive - Distinction
โข Predictive
โข Target is available
โข Categorical or continues
โข Descriptive
โข Target is not available
โข Association rules, Sequence rules, and Clustering
8. Analytical Model Requirements
โข A first critical success factor is business relevance
โข The analytical model should actually solve the business
problem for which it was developed.
โข It makes no sense to have a working analytical model
that got sidetracked from the original problem
statement.
โข In order to achieve business relevance, the business
problem to be solved is appropriately defined,
qualified, and agreed upon by all parties involved at the
outset of the analysis.
9. Analytical Model Requirements
โข A second criterion is statistical performance.
โข The model should have statistical significance and
predictive power.
โข Depending upon the application Analytical models should
also be
โข Interpretable - understanding the patterns that the
analytical model captures
โข Justifiable - the degree to which a model corresponds
to previous business knowledge
10. Analytical Model Requirements
โข Analytical models should also be operationally efficient.
โข the efforts needed to collect the data,
โข preprocess the model,
โข evaluate the model
โข feed its outputs to the business application
โข The economic cost needed to set up the analytical model
โข Analytical models should also comply with both local and
international regulation.
11. STANDARDIZING AND CATEGORIZING
โข Data standardization is the process of converting data to a
common format to enable users to process and analyze it
โข Data standardization is the critical process of bringing data into a
common format that allows for
โข collaborative research,
โข large-scale analytics,
โข sharing of sophisticated tools and methodologies.
12. Steps to standardize data
โข Four steps to standardize customer data for better insights
โข Step 1: Conduct a data source audit.
โข Step 2: Define standards for data formats.
โข Step 3: Standardize the format of external data sources.
โข Step 4: Standardize existing data in the database.
14. CATEGORIZATION
โข Categorization is a major component of qualitative data analysis by
which investigators attempt to group patterns observed in the data
into meaningful units or categories.
โข Categorization is also referred as coarse classification, classing,
grouping, binning, etc.
โข For categorical variables, it is needed to reduce the number of
categories.
โข E.g. Purpose of loan โ has 50 values.
โข 49 dummy variables are needed to estimate one variable.
15. Categorization Methods
โข Two very basic methods are used for categorization.
โข equal interval binning
โข equal frequency binning.
โข Consider, for example, the income values 1,000, 1,200, 1,300, 2,000, 1,800,
and 1,400.
โข Equal interval binning would create two bins with the same rangeโBin 1:
1,000, 1,500 and
โข Bin 2: 1,500, 2,000
โข Equal frequency binning would create two bins with the same number of
observationsโ
โข Bin 1: 1,000, 1,200, 1,300;
โข Bin 2: 1,400, 1,800,2,000.
16. Weight of Evidence Coding
โข Variable transformation of independent variables.
โข Used for grouping, variable selection etc.
โข The weight of evidence tells the predictive power of an independent
variable in relation to the dependent variable.
17. Weight of Evidence Coding
โข Example: Predict good or bad customer based on age or income
โข Model 1:
โข Customer type = a + b (income) ----> Predicts 70% correctly
โข Model 2:
โข Customer type = a + b (age) ----> Predicts 60% correctly
โข So the ability of โincomeโ to separate good and bad is more than
โageโ and hence the weight
18. Weight of Evidence Coding
โข Definition:
โข Since it evolved from credit scoring world, it is generally described as
a measure of the separation of good and bad customers.
โข "Bad Customers" refers to the customers who defaulted on a loan.
and "Good Customers" refers to the customers who paid back loan.
โข Positive WOE means Distribution of Goods > Distribution of Badโs
Negative WOE means Distribution of Goods < Distribution of Badโs
20. DATA SEGMENTATION
โข Sometimes data is segmented before the analytical modeling starts.
โข The segmentation can be conducted
โข using the experience and knowledge from a business expert
โข based on statistical analysis using decision trees, kโmeans, or selfโorganizing
maps
โข Segmentation is used to estimate different analytical models each
personalized to a specific segment.
โข This process must be done careful because it may lead to increase the
production, monitoring and maintenance cost.
21. DATA SEGMENTATION
โข Data Segmentation is the process of taking the data you hold and
dividing it up and grouping similar data together based on the chosen
parameters
โข So that you can use it more efficiently within marketing and
operations
โข It is the process of grouping your data into at least two subsets
22. F TEST
โข The F-test is used to carry out the test for the equality of the two
population variances.
โข If a researcher wants to test whether or not two independent
samples have been drawn from a normal population with the same
variability, then he generally employs the F-test.
23. F-TEST
โข It is a statistical test used to compare any two different data sets
โข It gives the mean, variance, observations etc details
โข F-Test :
โข compares your model with zero predictor variables and decides whether
your added coefficients improved the model.
24. T-distribution
โข The t-distribution is used as an alternative to the normal distribution
when sample sizes are small in order to estimate confidence
โข It also determine critical values that an observation is a given distance
from the mean.