SlideShare a Scribd company logo
Introduction to BIG DATA
Types, Characteristics & Benefits
In order to understand 'Big Data', we first need to know what
'data' is. Oxford dictionary defines 'data' as –
"The quantities, characters, or symbols on which operations
are performed by a computer, which may be stored and
transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media. “
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term
used to describe collection of data that is huge in size and yet growing
exponentially with time. In short, such a data is so large and complex
that none of the traditional data management tools are able to store
it or process it efficiently.
Mr. V.C.Bhagawat, ATRIA, Dept.of MCA.
Examples Of 'Big Data'
The New York Stock Exchange generates
about one terabyte of new trade data per
day.
Social Media Impact
Statistic shows that 500+terabytes of new data
gets ingested into the databases of social
media site Facebook, every day. This data is
mainly generated in terms of photo and video
uploads, message exchanges, putting
comments etc.
Single Jet engine can
generate 10+terabytes of data in 30
minutes of a flight time. With many
thousand flights per day, generation of
data reaches up to many Petabytes.
Categories Of 'Big Data'
Big data' could be found in three forms:
Structured
Any data that can be stored, accessed and processed in
the form of fixed format is termed as a 'structured' data.
Over the period of time, talent in computer science have
achieved greater success in developing techniques for
working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
However, now days, we are foreseeing issues when size
of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabyte.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Examples Of Structured Data
An 'Employee' table in a database is an example of
Structured Data
Semi-structured
Semi-structured data can contain both the forms of data. We can see
semi-structured data as a strcutured in form but it is actually not defined with
e.g. a table definition in relational DBMS. Example of semi-structured data is a
data represented in XML file , Email, XML, JSON DOCUMENTS, HTML, EDI, RDF
Examples Of Semi-structured Data
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Unstructured
Any data with unknown form or the
structure is classified as unstructured data.
In addition to the size being huge,
un-structured data poses multiple
challenges in terms of its processing for
deriving value out of it. Typical example of
unstructured data is, a heterogeneous data
source containing a combination of simple
text files, images, videos etc. Now a day
organizations have wealth of data available
with them but unfortunately they don't
know how to derive value out of it since
this data is in its raw form or unstructured
format.
Examples Of Un-structured Data
Output returned by 'Google Search'
Characteristics Of 'Big Data'
Volume
The name 'Big Data' itself is related to a size which is enormous.
Size of data plays very crucial role in determining value out of data.
Also, whether a particular data can actually be considered as a Big
Data or not, is dependent upon volume of data. Hence, 'Volume' is
one characteristic which needs to be considered while dealing with
'Big Data'.
• 500 hours of video are uploaded to youetube/min
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Velocity
The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands,
determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from
sources like business processes, application logs, networks and
social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Facebook users for example upload more than 900 million photos
a day. Facebook's data warehouse stores upwards of 300
petabytes of data, but the velocity at which new data is created
should be taken into account. Facebook claims 600 terabytes of
incoming data per day.
Google alone processes on average more than "40,000 search queries
every second," which roughly translates to more than 3.5 billion searches
per day.
Variety
Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. During earlier days, spreadsheets and
databases were the only sources of data considered by most of the
applications. Now days, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. is also being considered in the
analysis applications. This variety of unstructured data poses certain
issues for storage, mining and analysing data.
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the
data that is being stored, and mined meaningful to the problem being
analyzed.
Veracity in data analysis is the biggest challenge when compares to things
like volume and velocity.
Uncertainty due to inconsistency &
Incompleteness , ambiguity, etc..
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Validity
validity refers to how accurate and correct the data is for its
intended use. According to Forbes, an estimated 60 percent
of a data scientist's time is spent cleansing their data before
being able to do any analysis.
Volatility
How old does your data need to be before it is considered irrelevant,
historic, or not useful any longer? How long does data need to be kept
for?
Vulnerability
A vulnerability may also refer to any type of weakness in a computer
system itself, in a set of procedures, or in anything that leaves
information security exposed to a threat.
Visualization
You can't rely on traditional graphs when trying to plot a
billion data points, so you need different ways of
representing data such as data clustering or using tree
maps, sunbursts, parallel coordinates, circular network
diagrams, or cone trees.
Value
Last, but arguably the most important of all, is value. The
other characteristics of big data are meaningless if you
don't derive business value from the data.
Quiz Time :
•
1. Data in bytes size is called big data.
Data Analytics
Data Analytics
Data analysis, also known as analysis of data or data
analytics, is a process of inspecting, cleansing, transforming,
goal of discovering useful
and modeling data with the
information, suggesting conclusions, and supporting
decision-making.
has multiple facets
Data analysis
encompassing diverse techniques under
and approaches,
a variety of
names, in different business, risk, health care, web,
science, and social science domains.
Analytics is not a tool or technology rather it is the way of
thinking or acting.
Types Of Analyses Every Data Scientist Should Know
Descriptive
Predictive
Prescriptive
Diagnostic
Exploratory
Inferential
Types Of Analyses Every Data Scientist Should Know
Descriptive (least amount of effort):
Descriptive analytics uses existing information from the past to
understand decisions in present and helps decide an
effectivesource of action in future.
The discipline of quantitatively describing the main features of a
collection of data. In essence, it describes a set of data.
–Typically the first kind of data analysis performed on a data set
–Commonly applied to large volumes of data, such as census data
-The description and interpretation processes are different steps
–Univariate and Bivariate are two types of statistical descriptive
analyses.
–Type of data set applied to: Census Data Set – a whole population
Descriptive Analytics: What is happening?
Insight into the past
Descriptive analysis or statistics does exactly what the name
implies they “Describe”, or summarize raw data and make it
something that is interpretable by humans.
They are analytics that describe the past.
Are useful because they allow us to learn from past behaviors,
and understand how they might influence future outcomes.
Common examples of descriptive analytics are
•reports that provide historical insights regarding the company’s
production, financials, operations, sales, finance, inventory and
customers.
Predictive Analytics: What is likely to happen?
Understanding the future
“Predict” what might happen.
These analytics are about understanding the future.
Predictive analytics provide estimates about the likelihood of a
future outcome.
The foundation of predictive analytics is based on probabilities.
Few Examples are :
•understanding how sales might close at the end of the year,
•predicting what items customers will purchase together, or
•forecasting inventory levels based upon a myriad of variables.
Prescriptive Analytics: What do I need to do?
Advise on possible outcomes
allows users to “prescribe” a number of different possible actions
to and guide them towards a solution.
prescriptive analytics predicts not only what will happen, but
also why it will happen providing recommendations regarding
actions that will take advantage of the predictions.
Prescriptive analytics use a combination of techniques and tools
such as business rules, algorithms, machine learning and
computational modelling procedures.
optimize production, scheduling and inventory in the supply
chain to make sure that are delivering the right products at the
right time and optimizing the customer experience.
Diagnostic Analytics
is a form of advanced analytics which examines data or
content to answer the question “Why did it happen?”, and is
characterized by techniques such as drill-down, data
discovery, data mining and correlations.
Ex: healthcare provider compares patients’ response to a
promotional campaign in different regions;
Exploratory
An approach to analyzing data sets to find previously unknown
relationships.
–Exploratory models are good for discovering new connections
–They are also useful for defining future studies/questions
–Exploratory analyses are usually not the definitive answer to the
question at hand, but only the start
–Exploratory analyses alone should not be used for generalizing
and/or predicting
Ex: Microarray studies :
aimed to find uncharacterised genes, which act at specific points
during the cell cycle
Inferential
Aims to test theories about the nature of the world in general
(or some part of it) based on samples of “subjects” taken from
the world (or some part of it). That is, use a relatively small
sample of data to say something about a bigger population.
Inference involves estimating both the quantity you care about
and your uncertainty about your estimate
– Inference depends heavily on both the population and the
sampling scheme
Type of data set applied to: Observational, Cross Sectional
Time Study, and Retrospective Data Set – the right, randomly
sampled population
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Big Data Technology
EXAMPLE APPLICATIONS
the relevance, importance, and impact of analytics are now
bigger than ever before and, given that more and more data
are being collected and that there is strategic value in knowing
what is hidden in data, analytics will continue to grow.
ANALYTICS PROCESS MODEL
• Problem Definition.
• Identification of data source.
• Selection of Data.
• Data Cleaning.
• Transformation of data.
• Analytics.
• Interpretation and Evaluation.
Problem Definition.
• Problem identification and definition :The problem is
a situation that is judged as something that needs to
be corrected.
• It is the job of the analyst to make sure that the right
problem Is solved.
Problem can be identified through :
• Comparative / Benchmarking studies.
Benchmarking is comparing one's business
processes and performance metrics to industry bests
and best practices from other companies.
•Performance reporting.
Assessment of present performance against goals
objectives.
•SWOT analysis.
Swot Analysis
Depending on type of the problem, source data need to be identified .
As data is the key ingredient to any analytical exercise and the selection of data
will have a deterministic impact on the analytical models that we are building
Few Data Collection Technique:
•Using data that has already been collected by others.
•Systematically selecting and watching characteristics of people,
object and events.
•Oral questioning of respondents either individually or a group.
•Facilitating free discussions on specific topics with selected
group of participants.
Data Storage: All data will then be gathered in a staging area, which could be,
for example, a data mart or data warehouse.
Data Exploration / Data Cleaning
Before a formal data analysis can be conducted, the analyst must know how
many cases are there in the data set, what variables are included, how many
missing observations are there and what general hypothesis the data is likely
to suffer.
Analyst commonly use visualization for data exploration because it allows users
to quickly and simply view most of the relevant features of their data set.
basic exploratory analysis can be considered here.
Like: online analytical processing (OLAP) facilities for multidimensional data
analysis (e.g., roll‐up, drill down, slicing and dicing).
Analytics Model Building
This is the entire process of implementing the solution.
The majority of the project time is spent in the solution
implementation step. the analytical approach of building a
model is a very iterative process because there is final or
perfect solution.
Validate model:
Like model building the process of validating a model is also
iterative
Evaluation / monitoring:
on going process essentially aimed at looking at the effectiveness
of the solutions over time.
Since analytical problem solving approach is diff from other
approach:
Points to remember are :
•There is a clear confidence on data to drive solution
identification
•we are using analytical technique based on numeric theories
•you need to have a good understanding of theoretical
concepts to business situations in order to build a feasible
solution.
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
ANALYTICAL MODEL REQUIREMENTS
A good analytical model should satisfy several requirements, depending on the
application area.
Business relevance.
should actually solve the business problem for which it was developed.
it is of key importance that the business problem to be solved is appropriately
defined, qualified, and agreed upon by all parties involved at the outset of the
analysis.
Statistical Performance
The model should have statistical significance and predictive
power.
Measurement of this depends on type of analytics selected.
We have various measures to quantify it.
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Interpretable and Justifiable.
Interpretability
refers to understanding the patterns that the analytical model
captures.
Ex:- in credit risk modeling or medical diagnosis, interpretable
models are absolutely needed to get good insight into the
underlying data patterns.
Justifiability refers to the degree to which a model corresponds to
prior business knowledge and intuition (Intuition is the ability to
acquire knowledge without proof, evidence, or conscious
reasoning, or without understanding how the knowledge was
acquired.)
For example, a model stating that a higher debt ratio results in
more creditworthy clients may be interpretable, but is not
justifiable because it contradicts basic financial intuition. Note that
both interpretability and justifiability often need to be balanced
against statistical performance.
Justifiability
Analytical models should also be operationally efficient.
Effortsneeded to collect the data, preprocess it,
evaluate the model, and feed its outputs to the
business application
Operational efficiency also entailsthe efforts needed
to monitor and back test the model, and re-estimate it
when necessary.
Operationally Efficient.
Economic cost
This includes the costs to gather and preprocess the data, the
costs to analyze the data, and the costs to put the resulting
analytical models into production.
In addition, the software costs and human and computing
resources should be taken into account here. Ex cost–benefit
analysis
Regulation and Legislation .
Given the importance of analytics nowadays, more and more
regulation is being introduced relating to the development and use
of the analytical models
Context of privacy, many new regulatory developments are taking
place at various levels. Ex: the use of cookies in a web
analytics context. Basel Accords in Credit risk models, solvency II
Insurance sectors
Structured , low level ,Detailed
, Dun & Bradstreet, thomson reuters, Verisk
DATA SAMPLING and PRE
PROCESSING
In real life data can be dirty because of inconsistencies,
incompleteness, duplication, and merging problems.
Statistical analysis technique used to select, manipulate and
analyze a representative subset of data points to identify
patterns and trends in the larger data set being examined.
It helps for data scientists, predictive modelers and
other data analysts to work with a small, manageable
amount of data
Identifying and analyzing a representative sample is
more efficient and cost-effective than surveying the
entirety of the data or population.
Challenges of data sampling
• the size of the required data sample and the
possibility of introducing a sampling error.
• [A sampling error is a statistical error that occurs
when an analyst does not select a sample that
represents the entire population of data and the
results found in the sample do not represent the
results that would be obtained from the entire
population.]
Types of data sampling methods
• There are many different methods for
drawing samples from data; the ideal one
depends on the data set and situation.
• Sampling can be based on probability and
non –probability.
• Probability Sampling is a sampling technique in
which sample from a larger population are
chosen using the theory of probability. For a
participant to be considered as a probability
sample, he/she must be selected using a random
selection.
• The most important requirement of probability
sampling is that everyone in your population has
a known and an equal chance of getting selected.
• For example, if you have a population of 100
people every person would have odds of 1 in 100
for getting selected. Probability sampling gives
you the best chance to create a sample that is
truly representative of the population.
Types of Probability Sampling
Simple random sampling as the name suggests is a
completely random method of selecting the sample. This
sampling method is as easy as assigning numbers to the
individuals (sample) and then randomly choosing from those
numbers through an automated process. Finally, the numbers
that are chosen are the members that are included in the
sample.
samples are chosen in this method of sampling: Lottery system
and using number generating software/ random number table.
Stratified Random sampling involves a
method where a larger population can be divided into smaller
groups, that usually don’t overlap but represent the entire
population together. While sampling these groups can be
organized and then draw a sample from each group separately.
A common method is to arrange or classify by Gender, age,
ethnicity and similar ways.
Splitting subjects into mutually exclusive groups and then using
simple random sampling to choose members from groups.
Members in each of these groups should be distinct so that
every member of all groups get equal opportunity to be selected
using simple probability. This sampling method is also called
“random quota sampling”
Cluster random sampling is a way to randomly select
participants when they are geographically spread out. For
example, if you wanted to choose 100 participants from the entire
population of the U.S., it is likely impossible to get a complete list
of everyone. Instead, the researcher randomly selects areas (i.e.
cities or counties) and randomly selects from within those
boundaries.
Cluster sampling usually analyzes a particular population in
which the sample consists of more than a few elements, for
example, city, family, university etc. The clusters are then
selected by dividing the greater population into various smaller
sections.
Systematic Sampling is when you choose every “nth” individual
to be a part of the sample. For example, you can choose every 3rd
person to be in the sample. Systematic sampling is an extended
implementation of the same old probability technique in which
each member of the group is selected at regular periods to form a
sample. There’s an equal opportunity for every member of a
population to be selected using this sampling technique.
non-probability sampling
Non-probability sampling is defined as a sampling technique in
which the researcher selects samples based on the subjective
judgment of the researcher rather than random selection. It is a
less stringent method.
This sampling method depends heavily on the expertise of the
researchers. It is carried out by observation, and researchers use
it widely qualitative research.
Non-probability sampling is a sampling method in which not all
members of the population have an equal chance of participating
in the study, unlike probability sampling, where each member of
the population has a known chance of being selected.
Non-probability sampling is most useful for exploratory studies
like a pilot survey (deploying a survey to a smaller sample
compared to pre-determined sample size).
Researchers use this method in studies where it is not possible to
draw random probability sampling due to time or cost
considerations.
Convenience sampling:
Convenience sampling is a non-probability sampling
technique where samples are selected from the population
only because they are conveniently available to the
researcher.
Researchers choose these samples just because they are easy
to recruit, and the researcher did not consider selecting a sample
that represents the entire population.
Ideally, in research, it is good to test a sample that represents the
population. But, in some research, the population is too large to
examine and consider the entire population. It is one of the
reasons why researchers rely on convenience sampling, which is
the most common non-probability sampling method, because of
its speed, cost-effectiveness, and ease of availability of the
sample.
Consecutive sampling:
This non-probability sampling method is very similar to
convenience sampling, with a slight variation.
Here, the researcher picks a single person or a group of a sample,
conducts research over a period, analyzes the results, and then
moves on to another subject or group if needed. Consecutive
sampling technique gives the researcher a chance to work with
many topics and fine-tune his/her research by collecting results
that have vital insights.
Quota sampling:
Hypothetically consider, a researcher wants to study the career
goals of male and female employees in an organization. There are
500 employees in the organization, also known as the population.
To understand better about a population, the researcher will need
only a sample, not the entire population. Further, the researcher is
interested in particular strata within the population. Here is where
quota sampling helps in dividing the population into strata or
groups.
Judgmental or Purposive sampling:
In the judgmental sampling method, researchers select the
samples based purely on the researcher’s knowledge and
credibility. In other words, researchers choose only those
people who they deem fit to participate in the research study.
Judgmental or purposive sampling is not a scientific method of
sampling, and the downside to this sampling technique is that the
preconceived notions of a researcher can influence the results.
Thus, this research technique involves a high amount of
ambiguity.
TYPES OF STATISTICAL DATA
ELEMENTS
Data Types are an important concept of statistics, which needs to be understood, to
correctly apply statistical measurements to your data and therefore to correctly conclude
certain assumptions about it. This blog post will introduce you to the different data types
you need to know, to do proper exploratory data analysis (EDA), which is one of the
most underestimated parts of a machine learning project.
Categorical data represent characteristics such as a person’s
gender, marital status, hometown, or the types of movies they like.
Categorical data can take on numerical values (such as “1”
indicating male and “2” indicating female), but those numbers
don’t have mathematical meaning. You couldn’t add them
together, for example. (Other names for categorical data
are qualitative data, or Yes/No data.)
Numerical data. These data have meaning as a measurement,
such as a person’s height, weight, IQ, or blood pressure; or
they’re a count, such as the number of stock shares a person
owns, how many teeth a dog has, or how many pages you can
read of your favorite book before you fall asleep. (Statisticians
also call numerical data quantitative data.)
Numerical data can be further broken into two types: discrete
and continuous.
Discrete data represent items that can be counted; they take on
possible values that can be listed out. The list of possible values
may be fixed (also called finite); or it may go from 0, 1, 2, on to
infinity (making it countably infinite).
For example, the number of heads in 100 coin flips takes on
values from 0 through 100 (finite case), but the number of flips
needed to get 100 heads takes on values from 100 (the fastest
scenario) on up to infinity (if you never get to that 100th heads). Its
possible values are listed as 100, 101, 102, 103, . . . (representing
the countably infinite case).
Continuous data represent measurements; their possible values
cannot be counted and can only be described using intervals on
the real number line.
For example, the exact amount of gas purchased at the pump for
cars with 20-gallon tanks would be continuous data from 0 gallons
to 20 gallons, represented by the interval [0, 20], inclusive. You
might pump 8.40 gallons, or 8.41, or 8.414863 gallons, or any
possible number from 0 to 20.
uncountably infinite. For
In this way, continuous data
ease
can be thought of as being
of recordkeeping, statisticians
usually pick some point in the number to round off. Another
example would be that the lifetime of a C battery can be anywhere
from 0 hours to an infinite number of hours (if it lasts forever),
technically, with all possible values in between. Granted, you don’t
expect a battery to last more than a few hundred hours, but no
one can put a cap on how long it can go (remember the Energizer
fundamental levels of measurement
scales
Nominal, Ordinal, Interval and Ratio are defined as the four
fundamental levels of measurement scales that are used to
capture data in the form of surveys and questionnaires
Nominal Scale: 1st Level of Measurement
Nominal scale is a naming scale, where variables are simply
“named” or labeled, with no specific order.
called the categorical variable scale, is defined as a scale used
for labeling variables into distinct classifications and doesn’t
involve a quantitative value or order. This scale is the simplest of
the four variable measurement scales.
Where do you live?
1Suburbs
2City
3Town
Nominal scale is often used in research surveys and questionnaires where
only variable labels hold significance.
Which brand of smartphones do you prefer?” Options : “Apple”- 1 , “Samsung”-2,
“OnePlus”-3.
Ordinal Scale: 2nd Level of
Measurement
Ordinal scale has all its variables in a specific order, beyond just
naming them.
variable measurement scale used to simply depict the order of
variables and not the difference between each of the variables.
These scales are generally used to depict non-mathematical
ideas such as frequency, satisfaction, happiness, a degree of
pain etc.
How satisfied are you with our services? Very Unsatisfied – 1
Unsatisfied – 2
Neutral – 3
Satisfied – 4 Very Satisfied – 5
Interval Scale: 3rd Level of Measurement
as a numerical scale where the order of the variables is known
as well as the difference between these variables. Variables
which have familiar, constant and computable differences are
classified using the Interval scale.
What is your family income?
What is the temperature in your city?
Ratio Scale: 4th Level of Measurement
as a variable measurement scale that not only produces the order
of variables but also makes the difference between variables
known along with information on the value of true zero.
With the option of true zero, varied inferential and descriptive
analysis techniques can be applied to the variables.
What is your weight in kilograms?Less than
50 kilograms
51- 70 kilograms
71- 90 kilograms
91-110 kilograms
More than 110 kilograms
VISUAL DATA EXPLORATION AND
EXPLORATORY STATISTICAL
ANALYSIS
Visual data exploration is a very important part of getting to know our data in an
“informal” way.
It allows analyst to get some initial insights into the data, which can then be
usefully adopted throughout the modeling.
Different plots/graphs can be useful here.
Chart Types: Pie Chart
ivisions, at year-end its top
eeing what portion of total
A pie chart is a circular graph divided in to slices. The larger a slice is the
bigger portion of the total quantity it represents.
A pie chart represents a variable’s distribution as a pie, whereby each
section represents the portion of the total percent taken by each value of
the variable
So, pie charts are best suited to depict sections of a whole. Example :
If a company operates three separate d management would be interested
in s revenue each division accounted for.
Chart Types: Bar charts
Bar charts represent the frequency of each of the values
absolute or relative) as bars.
(either
bar chart is composed of a series of bars illustrating a variable’s
development. Given that bar charts are such a common chart type,
people are generally familiar with them and can understand them easily.
A bar chart with one variable is easy to follow.
Bar charts are great when we want to track the development of one or
two variables over time. For example, one of the most frequent
applications of
corporate presentation is
s
bar charts in
to
revenues have developed during a
show how a company’s total
period.
given
Bar charts can work well for
comparison of
o
o
two variables
m
over
time
e
Let’s say we would
like t compare the
tw
revenues of
companies in the
timefra between 2014 and
2018.
Chart Types: Histogram charts
A histogram is a plot that lets you
discover, and show, the underlying
frequency distribution (shape) of a set
of continuous data. This allows the
inspection of the data for its underlying
distribution (e.g., normal distribution),
outliers, skewness, etc. An example of a
histogram, and the raw data it was
constructed from, is shown below:
36 25 38 46 55 68 72 55 36 38
67 45 22 48 91 46 52 61 58 55
To construct a histogram from a continuous variable you
first need to split the data into intervals, called bins.
In the example above, age has been split into bins, with
eachbin representing a 10-year period starting at 20
years.
Each bin contains the number of occurrences of scores in
the data set that are contained within that bin.
For the above data set, the frequencies in each bin have
beentabulated along with the scores that contributed to
the frequency in each bin (see below):
Score s
Included in
Bin
Bin Frequency
20-30 2 25,22
30-40 4 36,38,36,38
40-50 4 46,45,48,46
50-60 5 55,55,52,58,55
60-70 3 68,67,61
70-80 1 72
80-90 900-100 1 -91
Notice that, unlike a bar chart, there are no "gaps" between the bars
(although some bars might be "absent" reflecting no frequencies).
This is because a histogram represents a continuous data set, and as
such, there are no gaps in the data (although you will have to decide
whether you round up or round down scores on the boundaries of
bins).
Chart Types: Scatter plots
A scatter plot is a type of chart that is often used in the fields of
statistics and data science. It consists of multiple data points plotted
across two axes. Each variable depicted in a scatter plot would have
multiple observations. If a scatter plot includes more than two
variables, then we would use different colours to signify that.
A scatter plot chart is a great indicator that allows us to see whether
there is a pattern to be found between two variables.
The x-axis contains information about house price, while the y-axis
indicates house size. There is an obvious pattern to be found – a
positive relationship between the two. The bigger a house is, the
higher its price.
Chart Types: Box plot
Box plots (also known as box and whisker plots) are a type of chart
often used in explanatory data analysis to visually show the distribution
of numerical data and skewness through displaying the data quartiles (or
percentiles) and averages.
Minimum Score
The lowest score, excluding outliers (shown at the end of the left
whisker).
Lower Quartile
Twenty-five percent of scores fall below the lower quartile value (also
known as the first quartile).
Median
The median marks the mid-point of the data and is shown by the line
that divides the box into two parts (sometimes known as the second
quartile). Half the scores are greater than or equal to this value and
half are less.
Upper Quartile
Seventy-five percent of the scores fall below the upper quartile value
(also known as the third quartile). Thus, 25% of data are above this
value.
Maximum Score
The highest score, excluding outliers (shown at the end of the right
whisker).
Whiskers
The upper and lower whiskers represent scores outside the middle
50% (i.e. the lower 25% of scores and the upper 25% of scores).
The Interquartile Range (or IQR)
This is the box plot showing the middle 50% of scores (i.e., the range
between the 25th and 75th percentile).
Box plots divide the data into
sections that each contain
ap the data
proximately 25% of
in that set.
Box plots are useful as they show the skewness(Skewness is
the degree of asymmetry observed in a probability
distribution) of a data set. Box plots are useful as they
show the average score of a data set.
The median is the average value from a set of
data and is shown by theline that divides the box
into two parts. Half the scores are greater than or
equal to this value and half are less.
When the median is in the middle of
the box, and the whiskers are about
the same on both sides of the box,
then the distribution is symmetric.
When the median is closer to the
bottom of the box, and if the whisker
is shorter on the lower end of the
box, then the distribution is positively
skewed (skewed right).
When the median is closer to the top
of the box, and if the whisker is
shorter on the upper end of the box,
then the distribution is negatively
skewed (skewed left).
Box plots are useful as they show outliers within a
data set.
An outlier is an observation that is numerically distant from the rest of
the data.
When reviewing a box plot, an outlier is defined as a data point that is
located outside the whiskers of the box plot.
MISSING VALUE TREATMENT
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
The reasons for occurrence of these missing values. They
may occur at two stages:
1.Data Extraction: It is possible that there are problems with extraction
process. In such cases, we should double-check for correct data with
data guardians. Some hashing procedures can also be used to make
sure data extraction is correct. Errors at data extraction stage are
typically easy to find and can be corrected easily as well.
2.Data collection: These errors occur at time of data collection and are
harder to correct. They can be categorized in four types:
1.Missing completely at random:
This is a case when the probability of missing variable is same
for all observations.Data is missing independently of both observed and
unobserved data.
Example: A survey respondent randomly skips a question.
For example: respondents of data collection process decide that they
will declare their earning after tossing a fair coin. If an head occurs,
respondent declares his / her earnings & vice versa. Here
each observation has equal chance of missing value.
2.Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables .The probability of missing data is related to the observed
data but not the missing data.
Example: People with higher income are less likely to report their
income, but we have other related variables like job title and education
level.
3.Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved
input variable. The missingness is related to the unobserved data itself.
For example: In a medical study, if a particular diagnostic causes
discomfort, then there is higher chance of drop out from the study.
This missing value is not at random unless we have included
“discomfort” as an input variable for all patients.
Example: People with very low incomes might be less likely to report
their income because of stigma.
4.Missing that depends on the missing value itself: This is a case
when the probability of missing value is directly correlated with
missing value itself. For example: People with higher or lower income
are likely to provide non-response to their earning.
Which are the methods to treat missing values ?
Some analytical techniques (e.g., decision trees) can directly deal with
missing values. Other techniques need some additional preprocessing.
The following are the most popular schemes to deal with missing
values:
1. Deletion . This is the most straightforward option and consists of
deleting observations or variables with lots of missing values. This, of
course, assumes that information is missing at random and has no
meaningful interpretation and/or relationship to the target.
Deletion is of types: List Wise Deletion and Pair Wise Deletion.
In list wise deletion, we delete observations where any of the variable is missing. Simplicity is
one of the major advantage of this method, but this method reduces the power of model
because it reduces the sample size.
In pair wise deletion, we perform analysis with all cases in which the variables of interest
are present. Advantage of this method is, it keeps as many cases available for analysis.
One of the disadvantage of this method, it uses different sample size for different
variables.
Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Replace (imputation ). This implies replacing the missing value
with a known value.
Imputation is a method to fill in the missing values with estimated
ones. The objective is to employ known relationships that can be
identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most
frequently used methods. It consists of replacing the missing data for
a given attribute by the mean or median (quantitative attribute) or
mode (qualitative attribute) of all known values of that variable.
Mean/Median/Mode Imputation: Replace missing values
with the mean, median, or mode of the observed values.
Simple but can distort data distribution.
K-Nearest Neighbors (KNN) Imputation: Replace missing
values using the nearest neighbors' values. Effective but
computationally expensive.
Multiple Imputation: Create multiple imputations for the
missing values and average the results. This approach
preserves the uncertainty of the missing data.
Regression Imputation: Use regression models to predict and
fill in missing values based on other variables.
Hot Deck Imputation: Replace missing values with values
from similar records in the dataset.
Machine Learning Models: Advanced models like Random
Forests or Neural Networks can be used to predict missing
values.
Two Types of Imputation
:
Generalized Imputation: In this case, we calculate the
mean or
median for all non missing values of that
variable then replace
missing value with mean or median. Like
in above table, variable “Manpower” is
missing so we take average of all non
missing values of “Manpower” (28.33)
and then replace missing value with it.
Similar case Imputation: In this case, we
calculate average for gender “Male”
(29.75) and “Female” (25) individually of
non missing values then replace the
missing value based on gender. For
“Male“, we will replace missing values of
manpower with 29.75 and for “Female”
with 25.
3. Keep. Missing values can be meaningful (e.g., a customer did not disclose his or her
income because he or she is currently unemployed). Obviously, this is clearly related to
the target (e.g., good/bad risk) needs to be considered as a separate category.

More Related Content

Similar to Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA

Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
Aditya205306
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Akshata Humbe
 
Big Data Presentation
Big Data PresentationBig Data Presentation
Big Data Presentation
AbhijeetPandey71
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
Vipul Kalamkar
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
Johnson Ubah
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
Sukirti Garg
 
Big data upload
Big data uploadBig data upload
Big data upload
Bhavin Tandel
 
Unit III.pdf
Unit III.pdfUnit III.pdf
Unit III.pdf
PreethaSuresh2
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
rajsharma159890
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
Akuhuruf
 
Big data
Big dataBig data
Big data
Abhishek Palo
 
Big data
Big dataBig data
Big data
Abhishek Palo
 
Unit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdfUnit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdf
Ranjeet Bhalshankar
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
Bigdata
BigdataBigdata
Sample
Sample Sample
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
PrabhaJoshi4
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
Audrey Britton
 
big data.pptx
big data.pptxbig data.pptx
big data.pptx
VirajSaud
 

Similar to Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA (20)

Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Presentation
Big Data PresentationBig Data Presentation
Big Data Presentation
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Unit III.pdf
Unit III.pdfUnit III.pdf
Unit III.pdf
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Unit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdfUnit No2 Introduction to big data.pdf
Unit No2 Introduction to big data.pdf
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Bigdata
BigdataBigdata
Bigdata
 
Sample
Sample Sample
Sample
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
big data.pptx
big data.pptxbig data.pptx
big data.pptx
 

Recently uploaded

Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
Virni Arrora
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
Universidad de Barcelona degree offer diploma Transcript
Universidad de Barcelona  degree offer diploma TranscriptUniversidad de Barcelona  degree offer diploma Transcript
Universidad de Barcelona degree offer diploma Transcript
taqyea
 
International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)
ijfls
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
taqyea
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
University of Wollongong degree offer diploma Transcript
University of Wollongong  degree offer diploma TranscriptUniversity of Wollongong  degree offer diploma Transcript
University of Wollongong degree offer diploma Transcript
taqyea
 
Contemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdfContemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdf
DngQuct12A1
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
tylermmo95
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
dizzycaye
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
PhngThLmHnh
 

Recently uploaded (20)

Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
Universidad de Barcelona degree offer diploma Transcript
Universidad de Barcelona  degree offer diploma TranscriptUniversidad de Barcelona  degree offer diploma Transcript
Universidad de Barcelona degree offer diploma Transcript
 
International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Universidad de Alcalá degree offer diploma Transcript
Universidad de Alcalá  degree offer diploma TranscriptUniversidad de Alcalá  degree offer diploma Transcript
Universidad de Alcalá degree offer diploma Transcript
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
University of Wollongong degree offer diploma Transcript
University of Wollongong  degree offer diploma TranscriptUniversity of Wollongong  degree offer diploma Transcript
University of Wollongong degree offer diploma Transcript
 
Contemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdfContemporary Islamic Finance Practices_2022.pdf
Contemporary Islamic Finance Practices_2022.pdf
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
 

Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA

  • 1. Introduction to BIG DATA Types, Characteristics & Benefits In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary defines 'data' as – "The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. “
  • 3. So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time. In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. Mr. V.C.Bhagawat, ATRIA, Dept.of MCA.
  • 4. Examples Of 'Big Data' The New York Stock Exchange generates about one terabyte of new trade data per day. Social Media Impact Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  • 5. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
  • 6. Categories Of 'Big Data' Big data' could be found in three forms: Structured Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabyte.
  • 7. Employee_ID Employee_Name Gender Department Salary_In_lacs 2365 Rajesh Kulkarni Male Finance 650000 3398 Pratibha Joshi Female Admin 650000 7465 Shushil Roy Male Admin 500000 7500 Shubhojit Das Male Finance 500000 7699 Priya Sane Female Finance 550000 Examples Of Structured Data An 'Employee' table in a database is an example of Structured Data
  • 8. Semi-structured Semi-structured data can contain both the forms of data. We can see semi-structured data as a strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in XML file , Email, XML, JSON DOCUMENTS, HTML, EDI, RDF Examples Of Semi-structured Data Personal data stored in a XML file- <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
  • 9. Unstructured Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos etc. Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format.
  • 10. Examples Of Un-structured Data Output returned by 'Google Search'
  • 11. Characteristics Of 'Big Data' Volume The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'. • 500 hours of video are uploaded to youetube/min
  • 13. Velocity The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. Facebook users for example upload more than 900 million photos a day. Facebook's data warehouse stores upwards of 300 petabytes of data, but the velocity at which new data is created should be taken into account. Facebook claims 600 terabytes of incoming data per day. Google alone processes on average more than "40,000 search queries every second," which roughly translates to more than 3.5 billion searches per day.
  • 14. Variety Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analysing data.
  • 16. Veracity Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Uncertainty due to inconsistency & Incompleteness , ambiguity, etc..
  • 18. Validity validity refers to how accurate and correct the data is for its intended use. According to Forbes, an estimated 60 percent of a data scientist's time is spent cleansing their data before being able to do any analysis. Volatility How old does your data need to be before it is considered irrelevant, historic, or not useful any longer? How long does data need to be kept for? Vulnerability A vulnerability may also refer to any type of weakness in a computer system itself, in a set of procedures, or in anything that leaves information security exposed to a threat.
  • 19. Visualization You can't rely on traditional graphs when trying to plot a billion data points, so you need different ways of representing data such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams, or cone trees. Value Last, but arguably the most important of all, is value. The other characteristics of big data are meaningless if you don't derive business value from the data.
  • 20. Quiz Time : • 1. Data in bytes size is called big data.
  • 22. Data Analytics Data analysis, also known as analysis of data or data analytics, is a process of inspecting, cleansing, transforming, goal of discovering useful and modeling data with the information, suggesting conclusions, and supporting decision-making. has multiple facets Data analysis encompassing diverse techniques under and approaches, a variety of names, in different business, risk, health care, web, science, and social science domains. Analytics is not a tool or technology rather it is the way of thinking or acting.
  • 23. Types Of Analyses Every Data Scientist Should Know Descriptive Predictive Prescriptive Diagnostic Exploratory Inferential
  • 24. Types Of Analyses Every Data Scientist Should Know Descriptive (least amount of effort): Descriptive analytics uses existing information from the past to understand decisions in present and helps decide an effectivesource of action in future. The discipline of quantitatively describing the main features of a collection of data. In essence, it describes a set of data. –Typically the first kind of data analysis performed on a data set –Commonly applied to large volumes of data, such as census data -The description and interpretation processes are different steps –Univariate and Bivariate are two types of statistical descriptive analyses. –Type of data set applied to: Census Data Set – a whole population
  • 25. Descriptive Analytics: What is happening? Insight into the past Descriptive analysis or statistics does exactly what the name implies they “Describe”, or summarize raw data and make it something that is interpretable by humans. They are analytics that describe the past. Are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes. Common examples of descriptive analytics are •reports that provide historical insights regarding the company’s production, financials, operations, sales, finance, inventory and customers.
  • 26. Predictive Analytics: What is likely to happen? Understanding the future “Predict” what might happen. These analytics are about understanding the future. Predictive analytics provide estimates about the likelihood of a future outcome. The foundation of predictive analytics is based on probabilities. Few Examples are : •understanding how sales might close at the end of the year, •predicting what items customers will purchase together, or •forecasting inventory levels based upon a myriad of variables.
  • 27. Prescriptive Analytics: What do I need to do? Advise on possible outcomes allows users to “prescribe” a number of different possible actions to and guide them towards a solution. prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions. Prescriptive analytics use a combination of techniques and tools such as business rules, algorithms, machine learning and computational modelling procedures. optimize production, scheduling and inventory in the supply chain to make sure that are delivering the right products at the right time and optimizing the customer experience.
  • 28. Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the question “Why did it happen?”, and is characterized by techniques such as drill-down, data discovery, data mining and correlations. Ex: healthcare provider compares patients’ response to a promotional campaign in different regions;
  • 29. Exploratory An approach to analyzing data sets to find previously unknown relationships. –Exploratory models are good for discovering new connections –They are also useful for defining future studies/questions –Exploratory analyses are usually not the definitive answer to the question at hand, but only the start –Exploratory analyses alone should not be used for generalizing and/or predicting Ex: Microarray studies : aimed to find uncharacterised genes, which act at specific points during the cell cycle
  • 30. Inferential Aims to test theories about the nature of the world in general (or some part of it) based on samples of “subjects” taken from the world (or some part of it). That is, use a relatively small sample of data to say something about a bigger population. Inference involves estimating both the quantity you care about and your uncertainty about your estimate – Inference depends heavily on both the population and the sampling scheme Type of data set applied to: Observational, Cross Sectional Time Study, and Retrospective Data Set – the right, randomly sampled population
  • 33. EXAMPLE APPLICATIONS the relevance, importance, and impact of analytics are now bigger than ever before and, given that more and more data are being collected and that there is strategic value in knowing what is hidden in data, analytics will continue to grow.
  • 34. ANALYTICS PROCESS MODEL • Problem Definition. • Identification of data source. • Selection of Data. • Data Cleaning. • Transformation of data. • Analytics. • Interpretation and Evaluation.
  • 35. Problem Definition. • Problem identification and definition :The problem is a situation that is judged as something that needs to be corrected. • It is the job of the analyst to make sure that the right problem Is solved. Problem can be identified through : • Comparative / Benchmarking studies. Benchmarking is comparing one's business processes and performance metrics to industry bests and best practices from other companies. •Performance reporting. Assessment of present performance against goals objectives. •SWOT analysis.
  • 37. Depending on type of the problem, source data need to be identified . As data is the key ingredient to any analytical exercise and the selection of data will have a deterministic impact on the analytical models that we are building Few Data Collection Technique: •Using data that has already been collected by others. •Systematically selecting and watching characteristics of people, object and events. •Oral questioning of respondents either individually or a group. •Facilitating free discussions on specific topics with selected group of participants.
  • 38. Data Storage: All data will then be gathered in a staging area, which could be, for example, a data mart or data warehouse.
  • 39. Data Exploration / Data Cleaning Before a formal data analysis can be conducted, the analyst must know how many cases are there in the data set, what variables are included, how many missing observations are there and what general hypothesis the data is likely to suffer. Analyst commonly use visualization for data exploration because it allows users to quickly and simply view most of the relevant features of their data set. basic exploratory analysis can be considered here. Like: online analytical processing (OLAP) facilities for multidimensional data analysis (e.g., roll‐up, drill down, slicing and dicing).
  • 40. Analytics Model Building This is the entire process of implementing the solution. The majority of the project time is spent in the solution implementation step. the analytical approach of building a model is a very iterative process because there is final or perfect solution. Validate model: Like model building the process of validating a model is also iterative
  • 41. Evaluation / monitoring: on going process essentially aimed at looking at the effectiveness of the solutions over time. Since analytical problem solving approach is diff from other approach: Points to remember are : •There is a clear confidence on data to drive solution identification •we are using analytical technique based on numeric theories •you need to have a good understanding of theoretical concepts to business situations in order to build a feasible solution.
  • 43. ANALYTICAL MODEL REQUIREMENTS A good analytical model should satisfy several requirements, depending on the application area. Business relevance. should actually solve the business problem for which it was developed. it is of key importance that the business problem to be solved is appropriately defined, qualified, and agreed upon by all parties involved at the outset of the analysis.
  • 44. Statistical Performance The model should have statistical significance and predictive power. Measurement of this depends on type of analytics selected. We have various measures to quantify it.
  • 46. Interpretable and Justifiable. Interpretability refers to understanding the patterns that the analytical model captures. Ex:- in credit risk modeling or medical diagnosis, interpretable models are absolutely needed to get good insight into the underlying data patterns.
  • 47. Justifiability refers to the degree to which a model corresponds to prior business knowledge and intuition (Intuition is the ability to acquire knowledge without proof, evidence, or conscious reasoning, or without understanding how the knowledge was acquired.) For example, a model stating that a higher debt ratio results in more creditworthy clients may be interpretable, but is not justifiable because it contradicts basic financial intuition. Note that both interpretability and justifiability often need to be balanced against statistical performance. Justifiability
  • 48. Analytical models should also be operationally efficient. Effortsneeded to collect the data, preprocess it, evaluate the model, and feed its outputs to the business application Operational efficiency also entailsthe efforts needed to monitor and back test the model, and re-estimate it when necessary. Operationally Efficient.
  • 49. Economic cost This includes the costs to gather and preprocess the data, the costs to analyze the data, and the costs to put the resulting analytical models into production. In addition, the software costs and human and computing resources should be taken into account here. Ex cost–benefit analysis Regulation and Legislation . Given the importance of analytics nowadays, more and more regulation is being introduced relating to the development and use of the analytical models Context of privacy, many new regulatory developments are taking place at various levels. Ex: the use of cookies in a web analytics context. Basel Accords in Credit risk models, solvency II Insurance sectors
  • 50. Structured , low level ,Detailed , Dun & Bradstreet, thomson reuters, Verisk
  • 51. DATA SAMPLING and PRE PROCESSING In real life data can be dirty because of inconsistencies, incompleteness, duplication, and merging problems. Statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It helps for data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.
  • 52. Challenges of data sampling • the size of the required data sample and the possibility of introducing a sampling error. • [A sampling error is a statistical error that occurs when an analyst does not select a sample that represents the entire population of data and the results found in the sample do not represent the results that would be obtained from the entire population.]
  • 53. Types of data sampling methods • There are many different methods for drawing samples from data; the ideal one depends on the data set and situation. • Sampling can be based on probability and non –probability.
  • 54. • Probability Sampling is a sampling technique in which sample from a larger population are chosen using the theory of probability. For a participant to be considered as a probability sample, he/she must be selected using a random selection. • The most important requirement of probability sampling is that everyone in your population has a known and an equal chance of getting selected. • For example, if you have a population of 100 people every person would have odds of 1 in 100 for getting selected. Probability sampling gives you the best chance to create a sample that is truly representative of the population.
  • 55. Types of Probability Sampling Simple random sampling as the name suggests is a completely random method of selecting the sample. This sampling method is as easy as assigning numbers to the individuals (sample) and then randomly choosing from those numbers through an automated process. Finally, the numbers that are chosen are the members that are included in the sample. samples are chosen in this method of sampling: Lottery system and using number generating software/ random number table.
  • 56. Stratified Random sampling involves a method where a larger population can be divided into smaller groups, that usually don’t overlap but represent the entire population together. While sampling these groups can be organized and then draw a sample from each group separately.
  • 57. A common method is to arrange or classify by Gender, age, ethnicity and similar ways. Splitting subjects into mutually exclusive groups and then using simple random sampling to choose members from groups. Members in each of these groups should be distinct so that every member of all groups get equal opportunity to be selected using simple probability. This sampling method is also called “random quota sampling”
  • 58. Cluster random sampling is a way to randomly select participants when they are geographically spread out. For example, if you wanted to choose 100 participants from the entire population of the U.S., it is likely impossible to get a complete list of everyone. Instead, the researcher randomly selects areas (i.e. cities or counties) and randomly selects from within those boundaries. Cluster sampling usually analyzes a particular population in which the sample consists of more than a few elements, for example, city, family, university etc. The clusters are then selected by dividing the greater population into various smaller sections.
  • 59. Systematic Sampling is when you choose every “nth” individual to be a part of the sample. For example, you can choose every 3rd person to be in the sample. Systematic sampling is an extended implementation of the same old probability technique in which each member of the group is selected at regular periods to form a sample. There’s an equal opportunity for every member of a population to be selected using this sampling technique.
  • 60. non-probability sampling Non-probability sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. It is a less stringent method. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely qualitative research.
  • 61. Non-probability sampling is a sampling method in which not all members of the population have an equal chance of participating in the study, unlike probability sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw random probability sampling due to time or cost considerations.
  • 62. Convenience sampling: Convenience sampling is a non-probability sampling technique where samples are selected from the population only because they are conveniently available to the researcher. Researchers choose these samples just because they are easy to recruit, and the researcher did not consider selecting a sample that represents the entire population. Ideally, in research, it is good to test a sample that represents the population. But, in some research, the population is too large to examine and consider the entire population. It is one of the reasons why researchers rely on convenience sampling, which is the most common non-probability sampling method, because of its speed, cost-effectiveness, and ease of availability of the sample.
  • 63. Consecutive sampling: This non-probability sampling method is very similar to convenience sampling, with a slight variation. Here, the researcher picks a single person or a group of a sample, conducts research over a period, analyzes the results, and then moves on to another subject or group if needed. Consecutive sampling technique gives the researcher a chance to work with many topics and fine-tune his/her research by collecting results that have vital insights.
  • 64. Quota sampling: Hypothetically consider, a researcher wants to study the career goals of male and female employees in an organization. There are 500 employees in the organization, also known as the population. To understand better about a population, the researcher will need only a sample, not the entire population. Further, the researcher is interested in particular strata within the population. Here is where quota sampling helps in dividing the population into strata or groups.
  • 65. Judgmental or Purposive sampling: In the judgmental sampling method, researchers select the samples based purely on the researcher’s knowledge and credibility. In other words, researchers choose only those people who they deem fit to participate in the research study. Judgmental or purposive sampling is not a scientific method of sampling, and the downside to this sampling technique is that the preconceived notions of a researcher can influence the results. Thus, this research technique involves a high amount of ambiguity.
  • 66. TYPES OF STATISTICAL DATA ELEMENTS Data Types are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it. This blog post will introduce you to the different data types you need to know, to do proper exploratory data analysis (EDA), which is one of the most underestimated parts of a machine learning project.
  • 67. Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. (Other names for categorical data are qualitative data, or Yes/No data.)
  • 68. Numerical data. These data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. (Statisticians also call numerical data quantitative data.) Numerical data can be further broken into two types: discrete and continuous.
  • 69. Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite). For example, the number of heads in 100 coin flips takes on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case).
  • 70. Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line. For example, the exact amount of gas purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to 20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41, or 8.414863 gallons, or any possible number from 0 to 20. uncountably infinite. For In this way, continuous data ease can be thought of as being of recordkeeping, statisticians usually pick some point in the number to round off. Another example would be that the lifetime of a C battery can be anywhere from 0 hours to an infinite number of hours (if it lasts forever), technically, with all possible values in between. Granted, you don’t expect a battery to last more than a few hundred hours, but no one can put a cap on how long it can go (remember the Energizer
  • 71. fundamental levels of measurement scales Nominal, Ordinal, Interval and Ratio are defined as the four fundamental levels of measurement scales that are used to capture data in the form of surveys and questionnaires
  • 72. Nominal Scale: 1st Level of Measurement Nominal scale is a naming scale, where variables are simply “named” or labeled, with no specific order. called the categorical variable scale, is defined as a scale used for labeling variables into distinct classifications and doesn’t involve a quantitative value or order. This scale is the simplest of the four variable measurement scales. Where do you live? 1Suburbs 2City 3Town Nominal scale is often used in research surveys and questionnaires where only variable labels hold significance. Which brand of smartphones do you prefer?” Options : “Apple”- 1 , “Samsung”-2, “OnePlus”-3.
  • 73. Ordinal Scale: 2nd Level of Measurement Ordinal scale has all its variables in a specific order, beyond just naming them. variable measurement scale used to simply depict the order of variables and not the difference between each of the variables. These scales are generally used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree of pain etc. How satisfied are you with our services? Very Unsatisfied – 1 Unsatisfied – 2 Neutral – 3 Satisfied – 4 Very Satisfied – 5
  • 74. Interval Scale: 3rd Level of Measurement as a numerical scale where the order of the variables is known as well as the difference between these variables. Variables which have familiar, constant and computable differences are classified using the Interval scale. What is your family income? What is the temperature in your city?
  • 75. Ratio Scale: 4th Level of Measurement as a variable measurement scale that not only produces the order of variables but also makes the difference between variables known along with information on the value of true zero. With the option of true zero, varied inferential and descriptive analysis techniques can be applied to the variables. What is your weight in kilograms?Less than 50 kilograms 51- 70 kilograms 71- 90 kilograms 91-110 kilograms More than 110 kilograms
  • 76. VISUAL DATA EXPLORATION AND EXPLORATORY STATISTICAL ANALYSIS Visual data exploration is a very important part of getting to know our data in an “informal” way. It allows analyst to get some initial insights into the data, which can then be usefully adopted throughout the modeling. Different plots/graphs can be useful here.
  • 77. Chart Types: Pie Chart ivisions, at year-end its top eeing what portion of total A pie chart is a circular graph divided in to slices. The larger a slice is the bigger portion of the total quantity it represents. A pie chart represents a variable’s distribution as a pie, whereby each section represents the portion of the total percent taken by each value of the variable So, pie charts are best suited to depict sections of a whole. Example : If a company operates three separate d management would be interested in s revenue each division accounted for.
  • 78. Chart Types: Bar charts Bar charts represent the frequency of each of the values absolute or relative) as bars. (either bar chart is composed of a series of bars illustrating a variable’s development. Given that bar charts are such a common chart type, people are generally familiar with them and can understand them easily. A bar chart with one variable is easy to follow. Bar charts are great when we want to track the development of one or two variables over time. For example, one of the most frequent applications of corporate presentation is s bar charts in to revenues have developed during a show how a company’s total period. given
  • 79. Bar charts can work well for comparison of o o two variables m over time e Let’s say we would like t compare the tw revenues of companies in the timefra between 2014 and 2018.
  • 80. Chart Types: Histogram charts A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc. An example of a histogram, and the raw data it was constructed from, is shown below: 36 25 38 46 55 68 72 55 36 38 67 45 22 48 91 46 52 61 58 55
  • 81. To construct a histogram from a continuous variable you first need to split the data into intervals, called bins. In the example above, age has been split into bins, with eachbin representing a 10-year period starting at 20 years. Each bin contains the number of occurrences of scores in the data set that are contained within that bin. For the above data set, the frequencies in each bin have beentabulated along with the scores that contributed to the frequency in each bin (see below):
  • 82. Score s Included in Bin Bin Frequency 20-30 2 25,22 30-40 4 36,38,36,38 40-50 4 46,45,48,46 50-60 5 55,55,52,58,55 60-70 3 68,67,61 70-80 1 72 80-90 900-100 1 -91 Notice that, unlike a bar chart, there are no "gaps" between the bars (although some bars might be "absent" reflecting no frequencies). This is because a histogram represents a continuous data set, and as such, there are no gaps in the data (although you will have to decide whether you round up or round down scores on the boundaries of bins).
  • 83. Chart Types: Scatter plots A scatter plot is a type of chart that is often used in the fields of statistics and data science. It consists of multiple data points plotted across two axes. Each variable depicted in a scatter plot would have multiple observations. If a scatter plot includes more than two variables, then we would use different colours to signify that. A scatter plot chart is a great indicator that allows us to see whether there is a pattern to be found between two variables.
  • 84. The x-axis contains information about house price, while the y-axis indicates house size. There is an obvious pattern to be found – a positive relationship between the two. The bigger a house is, the higher its price.
  • 85. Chart Types: Box plot Box plots (also known as box and whisker plots) are a type of chart often used in explanatory data analysis to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages.
  • 86. Minimum Score The lowest score, excluding outliers (shown at the end of the left whisker). Lower Quartile Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile). Median The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.
  • 87. Upper Quartile Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value. Maximum Score The highest score, excluding outliers (shown at the end of the right whisker). Whiskers The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores). The Interquartile Range (or IQR) This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).
  • 88. Box plots divide the data into sections that each contain ap the data proximately 25% of in that set.
  • 89. Box plots are useful as they show the skewness(Skewness is the degree of asymmetry observed in a probability distribution) of a data set. Box plots are useful as they show the average score of a data set. The median is the average value from a set of data and is shown by theline that divides the box into two parts. Half the scores are greater than or equal to this value and half are less.
  • 90. When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric. When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right). When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).
  • 91. Box plots are useful as they show outliers within a data set. An outlier is an observation that is numerically distant from the rest of the data. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.
  • 92. MISSING VALUE TREATMENT Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.
  • 93. The reasons for occurrence of these missing values. They may occur at two stages: 1.Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well. 2.Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types:
  • 94. 1.Missing completely at random: This is a case when the probability of missing variable is same for all observations.Data is missing independently of both observed and unobserved data. Example: A survey respondent randomly skips a question. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value. 2.Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables .The probability of missing data is related to the observed data but not the missing data. Example: People with higher income are less likely to report their income, but we have other related variables like job title and education level.
  • 95. 3.Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. The missingness is related to the unobserved data itself. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients. Example: People with very low incomes might be less likely to report their income because of stigma. 4.Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.
  • 96. Which are the methods to treat missing values ? Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other techniques need some additional preprocessing. The following are the most popular schemes to deal with missing values: 1. Deletion . This is the most straightforward option and consists of deleting observations or variables with lots of missing values. This, of course, assumes that information is missing at random and has no meaningful interpretation and/or relationship to the target.
  • 97. Deletion is of types: List Wise Deletion and Pair Wise Deletion. In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.
  • 98. In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables. Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.
  • 99. 2. Replace (imputation ). This implies replacing the missing value with a known value. Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
  • 100. Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the observed values. Simple but can distort data distribution. K-Nearest Neighbors (KNN) Imputation: Replace missing values using the nearest neighbors' values. Effective but computationally expensive. Multiple Imputation: Create multiple imputations for the missing values and average the results. This approach preserves the uncertainty of the missing data. Regression Imputation: Use regression models to predict and fill in missing values based on other variables. Hot Deck Imputation: Replace missing values with values from similar records in the dataset. Machine Learning Models: Advanced models like Random Forests or Neural Networks can be used to predict missing values.
  • 101. Two Types of Imputation : Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it. Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
  • 102. 3. Keep. Missing values can be meaningful (e.g., a customer did not disclose his or her income because he or she is currently unemployed). Obviously, this is clearly related to the target (e.g., good/bad risk) needs to be considered as a separate category.