Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Data Exploration
Dr. Windu Gata, M.Kom

SDN 5 Pagi Pondok
Pinang,
SMP 31 PGRI, SMP
178 Rempoa, SMA 86
1991
1995
1997
1997 1999
2006
2006
2008
2003
2011
2014
2022
Dr. Windu Gata, M.Kom
Internal Trainer Multimatics-Karya Talents 2023 IT Consultant since 1995 Lecturer since 2003 Computer Researcher since 2008

Water
gives
life
Water is
abudant
Water is
purified
Water is
distributed
Water is
democratic
Water is
fresh
Water is
human
https://medium.com/citizenme/data-is-the-new-water-seven-reasons-why-45511bc5b9bd
Water cycle - Wikipedia

Companies cannot
survive without
data
Companies can
drown in too
much data
You can be
surrounded by
data that you
can’t use
Data flows
everywhere
Data gets dirty
and stale if
left unattended
Expensive data
may not be
better data
Packaging
matters
Data management
is a long term
project
Data quality
should be fit
for purpose
Clean data at
the source
Ten Ways Data is Like Water | How to Leverage Data

Business Driver
1) The digitization of society;
2) The plummeting of technology costs;
3) Connectivity through cloud computing;
4) Increased knowledge about data science;
5) Social media applications;
6) The upcoming Internet-of-Things (IoT).

The digitization of society
• Big Data is largely consumer driven and consumer
oriented.
• Most of the data in the world is generated by
consumers, who are nowadays ʻalways-onʼ.
• Most people now spend 4-6 hours per day consuming
and generating data through a variety of devices and
(social) applications.
• With every click, swipe or message, new data is
created in a database somewhere around the world.
Because everyone now has a smartphone in their
pocket, the data creation sums to incomprehensible
amounts.
• Some studies estimate that 60% of data was
generated within the last two years, which is a good
indication of the rate with which society has digitized.

The plummeting of technology costs
• The costs of data
storage and processors
keep declining, making
it possible for small
businesses and
individuals to become
involved with Big Data.
• For storage capacity,
the often-cited Mooreʼs
Law still holds that the
storage density (and
therefore capacity) still
doubles every two
years.

Connectivity through cloud computing
• Cloud computing environments (where data
is remotely stored in distributed storage
systems) have made it possible to quickly
scale up or scale down IT infrastructure and
facilitate a payas-you-go model.
• This means that organizations that want to
process massive quantities of 25 data (and
thus have large storage and processing
requirements) do not have to invest in large
quantities of IT infrastructure.
• Instead, they can license the storage and
processing capacity they need and only pay
for the amounts they actually used.

Increased knowledge about data science
• In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years.
• The demand for data scientist (and similar job titles) has increased tremendously
and many people have actively become engaged in the domain of data science.

Social media applications
Social media data provides insights
into the behaviors, preferences
and opinions of ʻthe publicʼ on a
scale that has never been known
before.
Due to this, it is immensely
valuable to anyone who is able to
derive meaning from these large
quantities of data.
Social media data can be used to
identify customer preferences for
product development, target new
customers for future purchases,
or even target potential voters in
elections

The upcoming Internet-of-Things (IoT).
• The Internet of things
(IoT) is the network of
physical devices,
vehicles, home
appliances and other
items embedded with
electronics, software,
sensors, actuators, and
network connectivity
which enables these
objects to connect and
exchange data

Data is valuable
• Data is an assetwith unique
properties
• The value of data can and should
be expressed in economic
terms
Dama-DMBoK

Data and Information
• The Strategic Alignment Model
(Henderson and Venkatraman,
1999) abstracts the fundamental
drivers for any approach to data
management.
• At its center is the relationship
between data and information.
• Information is most often
associated with business
strategy and the operational use
of data

Data Governance and Data Management
• Data Governance (DG) is defined as the exercise of authority and control
(planning, monitoring, and enforcement) over the management of data
assets.
• All organizations make decisions about data, regardless of whether they have
a formal Data Governance function.
• The Data Governance function guides all other data management functions.
• The purpose of Data Governance is to ensure that data is managed properly,
according to policies and best practices (Ladley, 2012)
Data Management Requirements are
Business Requirements
• Managing data means managing the
quality of data
• It takes Metadata to manage data
• It takes planning to manage data
• Data Management requirements must
drive Information
Technology Decisions Data Management
depends on diverse skills
• Data Management is cross-functional
• Data management requires an
enterprise perspective
• Data management must account for a
range perspectives
Data Management is lifecycle management
• Different types of data have different
lifecycle characteristics
• Managing data includes managing the
risks associated with data

https://gambarpesona.blogspot.com/

Database administrator
A database administrator
implements and
manages the operational
aspects of cloud-native
and hybrid data platform
solutions that are built on
Database Server

Business Analyst
While some similarities exist between a
data analyst and business analyst, the
key differentiator between the two
roles is what they do with data.
1. A business analyst is closer to
the business and is a specialist
in interpreting the data that
comes from the visualization.
2. Often, the roles of data analyst
and business analyst could be
the responsibility of a single
person.

Data analyst
A data analyst
enables
businesses to
maximize the
value of their
data assets
through
visualization
and reporting
tools such as
Microsoft
Power BI
Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements

Data engineer
1. Data engineers provision and set up data platform technologies that are on-premises and in the cloud.
2. They manage and secure the flow of structured and unstructured data from multiple sources.
3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores.
4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.

Data scientist
1. Data scientists perform
advanced analytics to
extract value from data.
2. Their work can vary from
descriptive analytics to
predictive analytics.
3. Descriptive analytics
evaluate data through a
process known as
exploratory data analysis
(EDA).
4. Predictive analytics are
used in machine learning
to apply modeling
techniques that can
detect anomalies or
patterns.
5. These analytics are
important parts of forecast
models.

Machine Learning and Data Science
29
Machine
Learning
Deep
Learning
Artificial
Intelligence
Data
Science

The Data Science Process
30
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project

31
• "Understandable": can be defined in terms of business needs.
• "Actionable" can offer high-level direction as to how to approach a solution.
• Frame the problem.
• Description of problem, written clearly so it can be handed off to others.
• Identify why the problem must be solved.
• Rationale
• Benefits
• Lifetime and use
• Provide background information.
• Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.).
• Reference problems (i.e., similar problems you've solved before).
• Determine whether the problem is appropriate for data science.
• Some problems are more easily solved using traditional methods.
• Data science can be difficult and expensive.
• Data science must be justified as the optimal approach.
Problem Formulation
Problem formulation: The process of identifying an issue that should be addressed,
and putting that issue in terms that are understandable and actionable.

Identify & Collect
32
Frame the
Problem
Identify &
Collect Data
Train Models
Finalize the
Project

33
• Data does not always start out in a neatly packaged form.
• Individual pieces might span multiple repositories.
• Data might be mixed in with irrelevant or dissimilar data.
• You'll need to place data with one or more sets.
• Example: Data repository with information about salespeople.
• Data describing the same ideas is already in one place.
• May be considered a dataset.
• Example: Database A has customer demographics, B has actual transaction info.
• Data is spread out and takes different forms.
• Will be difficult to work with as is.
• Must be placed into one or more sets.
• Datasets can include any kind of data that's relevant to your goals.
• Might be unique to your industry/organization.
Datasets
Dataset: A collection of data that can be used to accomplish business goals.

34
• Structured:
• Facilitates searching, filtering, or extracting data.
• E.g., spreadsheet or database.
• Chunks of data can be retrieved using a programming or querying language.
• Unstructured:
• Not easy to query.
• E.g., images, video, textual contents, etc.
• Usually a larger proportion of data than structured data.
• Semi-structured:
• Aspects of both structured and unstructured.
• E.g., email content is unstructured, but email fields are structured.
• Some formats (like XML and JSON documents) can be in different forms.
• Server log output could be structured.
• Human-authored documents may not be structured.
Structure of Data

Process Data
35
Frame the
Problem
Identify &
Collect Data
Train Models
Finalize the
Project

36
Preliminary Data Transformation
• Transformation comes after extraction in ETL.
• Involves changing data in some way.
• You can make changes early.
• You know what to look for.
• Your tools can help you find issues.
• However, you won't be able to make some changes until after analysis.
• Tells you what to transform and how.
• Changes at this point are preliminary.
• You can still get quite a bit done.

37
• Data preparation alters data so it more effectively supports data science tasks.
• Tasks like analysis and modeling.
• Necessary for achieving business goals.
• Comprises multiple individual tasks.
• Purpose is to identify issues before data is loaded into its destination.
• Issues can be at a macro level or micro level.
• Data cleaning addresses inaccuracies and other problems with data.
• Subset of preparation.
• Duplicated data, poorly formatted data, corrupt data, etc.
• You can correct data or remove it.
• Choice of action depends on feasibility and impact on later processes.
• Data wrangling/munging are alternative terms.
• Often refers to manual work.
• Preparation can be automated so cleanup can repeat.
Data Preparation and Cleaning

Analyze
38
Frame the
Problem
Identify &
Collect Data
Train Models
Finalize the
Project

39
• Purpose is to maximize insights gleaned from the data.
• Objectives:
• Test and evaluate prior assumptions.
• Reveal underlying data structure.
• Determine important features/factors.
• Identify unwanted elements.
• Determine best path forward.
• EDA is flexible and often employs visualizations.
• Enables more open-ended exploration.
• Incorporates plots of raw data and summarized data.
• Arranging multiple plots can make it easier to recognize patterns.
• EDA is valuable at every step of the process.
• Changing the data or applying it could prompt EDA.
Exploratory Data Analysis
Exploratory data analysis: A data science approach to closely examining data in order to reveal
new information.

40
• Start by getting familiar with content and format of data.
• Try to identify:
• Number of columns
• Names of columns
• Data types of columns
• Number of rows
• Primary row identifiers
• Value representation
• Presence/number of missing values
• Use Python DataFrame functions:
• info() to get attributes.
• head() to get first few rows.
Dataset Content and Format

41
• Produces a value between +1 (positive correlation) and −1 (negative correlation).
• Shows the strength of their dependence on each other.
Correlation Coefficient
Pearson correlation coefficient: A measurement of the linear correlation between two variables
commonly called x and y.
Positive Correlation No Correlation Negative Correlation
r = 0.6 r = 0 r = −0.8

Correlation Strength
42
Strong Positive
Correlation
Strong Negative
Correlation
Weak Positive
Correlation
Weak Negative
Correlation

Frequency Distribution (Fruit Example)
43
0
20
40
60
80
100
Apples Bananas Grapes Oranges Pears
Fruit Frequency

Frequency Distribution (Height Example)
44
0
20
40
60
80
100
Height (Inches) Frequency

Probability Distribution (Fruit Example)
45
0
0.1
0.2
0.3
0.4
Apples Bananas Grapes Oranges Pears
Fruit Probability

Probability Distribution (Height Example)
46
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability

47
• Bell shaped
• Symmetrical
• Centered
• Unimodal
Normal Distribution
Frequency
or
Probability
Variable Values

48
Normal Distribution (Height Example)
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability

49
Non-Normal Distributions
Skewed
Distributions
Multi-Modal
Distributions

Standard Deviation Comparison
50
Mean: 67″
STDEV: 5
STDEV: 20

Standard Deviations in a Normal Distribution
(Height Example)
51
Standard Deviations
0
−3 3
2
1
−1
−2
68%
95%
99.7%
67″
57″ 77″
47″ 87″
37″ 97″

Skewness
52
Symmetrical
Positive Skew Negative Skew
Median
Mode Mean
Median
Mean Mode
Mean,
Median,
and Mode

Box Plots
53
Outlier
Median
Min
Max
Q1 Q3

Violin Plot
54
Low
probability
High
probability
Median

Line Plots
55
Trend line
Data point

Geographical Maps
57
Most expensive
homes
Less expensive
homes

Heatmaps
58
Areas with fewer houses are shown in purple.
Areas with more houses are shown in green.
Heatmap superimposed on a geographical map
Correlation matrix shown in a heatmap
Data pairs with lower correlation are shown in lighter shades.
Data pairs with higher correlation are shown in darker shades.

You Are Here (Process, Analyze, Train Models)
59
Frame the
Problem
Identify &
Collect Data
Train Models
Finalize the
Project

Machine Learning Algorithms
60

The Bias–Variance Tradeoff
61
• High bias:
• May underfit the training set
• More simplistic
• Less likely to be influenced by true relationships
between features and target outputs
• The sweet spot:
• Good enough fit on training datasets
• Just complex enough
• Skillful in finding true relationships between
features and target outputs while not overly
influenced by noise
• High variance:
• May overfit the training set
• More complex
• More likely to be influenced by false relationships
between features and target outputs ("noise")
Model Complexity
Error
The Sweet Spot

Holdout Method
62
Predictive Model
Training Set
Validation
Set
Test
Set
Algorithm
Learning
Holdout Sets
Original Data

k-Means Clustering (Slide 1 of 2)
63
Centroids
1.0
0.0
1.0
0.0

Linear Regression
65
50
0
100
0
Independent Variable
Dependent
Variable

Support-Vector Machines (SVMs)
67
Class 0
Class 1
0.0 2.0
1.0
1.0
2.0
0.0
Decision boundary
Support-vector
margin
Support-vector
margin
Support
vector
Support
vector

Customer Retention Example Tree
69
Satisfied <= 0.5
Samples = 8
T: 3 | F: 5
Class: Returning
Customer age <= 0.5
Samples = 5
T: 1 | F: 4
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 2 | F: 1
Class: Not returning
Samples = 1
T: 1 | F: 0
Samples = 2
T: 1 | F: 1
Samples = 2
T: 0 | F: 2
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 1 | F: 2
Class: Returning
Samples = 1
T: 0 | F: 1
Class: Returning
Samples = 2
T: 1 | F: 1

71
Naïve Bayes
𝜎(𝑡) =
1
1 + 𝑒−𝑡
Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation.
Where:
• y is the observed classification.
• x is the vector of dataset features.
• p(y|x) is the likelihood of y given x (posterior probability).
• p(x|y) is the likelihood of x given y.
• p(y) is the probability of y independent of the data (prior probability).
• p(x) is the probability of x independent of the data.
𝑝 𝑦|𝐱 =
𝑝 𝐱|𝑦 𝑝 𝑦
𝑝 𝐱

k-Nearest Neighbor (k-NN)
73
Class 0
Class 1
Example
Class 0
wins vote
𝑘 = 3
Max Depth
Surface
Area

Confusion Matrix
78
Estimation
No Yes
Actual
No
True
negatives
False
positives
Yes
False
negatives
True
positives
Estimation
Device
didn't fail
Device
failed
Actual
Device
didn't fail
513 8
Device
failed
4 17

79
• Drive failure model accuracy: ~98%.
• Intuitive, but often unreliable.
• You can have a high accuracy even if the model doesn't
excel at its purpose.
• Only really suitable in balanced datasets.
Accuracy
Correct estimations
All estimations
True Positives
+ True Negatives
+ False Positives
+ False Negatives
True Positives
+ True Negatives
TP + TN
TP + TN + FP + FN

80
• Drive failure model precision: 68%.
• More useful than accuracy in unbalanced datasets.
• Doesn't account for false negatives.
• Just one drive malfunctioning is undesirable.
• You could set your tolerance for false negatives higher, but
precision will still come up short.
Precision
Correct positive estimations
All positive estimations
True Positives
+ False Positives
True Positives
TP
TP + FP

81
• Drive failure model recall: 81%.
• Minimizes false negatives.
• You could predict all drives will fail, making recall 100%,
but the model would be useless.
• Not as good as precision at minimizing false positives.
Recall
Correct positive estimations
All relevant instances
True Positives
+ False Negatives
True Positives
TP
TP + FN

Precision–Recall Tradeoff
82
1.0
0.0
0.0
Recall
Precision
High precision,
low recall
Low precision,
high recall
1.0

F₁ Score
83
• Precision and recall are more useful in unbalanced datasets.
• They come with a tradeoff.
• Not always clear which metric is more useful.
• A false positive may be just as undesirable as a false negative.
• F₁ score helps you find optimal combination of both precision and recall.
𝐹1 = 2
precision ∙ recall
precision + recall
𝐹1 = 2
.87 ∙ .79
.87 + .79
• Resulting F₁ score is around 83%.

84
• Drive failure model specificity: 98%.
• Maximizes true negatives.
• Not useful in all cases, especially with imbalanced datasets.
• Customer attrition scenario based on satisfaction with a new product:
• Satisfaction is positive, lack of satisfaction is negative.
• Responses are balanced.
• Maximize true negatives to reduce attrition from unsatisfied customers.
• Might be a good case for specificity.
Specificity
Correct negative estimations
All actual negatives
True Negatives
+ False Positives
True Negatives
TN
TN + FP

Receiver Operating Characteristic (ROC) Curve
85
Model A
Random
guess
Model B
1.0
0.0
1.0
0.0
False Positive Rate
True
Positive
Rate
FP
FP + TN

Finalize
86
Frame the
Problem
Identify &
Collect Data
Train Models
Finalize the
Project

Know Your Audience
87
• Your audience might be:
• Just you, no reporting needed
• A single person
• A small group of stakeholders
• An entire organization
• You may need to adjust your reporting for:
• Different knowledge
• Different needs
• Different expectations

88
• Findings must be translated into business insights to demonstrate their value.
• Begin by reviewing the overall process and results.
• Ask yourself:
• What did I know before the project?
• What do I know now?
• How does my analysis supplement my knowledge?
• How do the models I built address issues?
• How do the results align with KPIs?
• What business actions can be taken?
• How can I improve the data science process in the future?
• Ensure insights are both relevant and in context.
• E.g., customers care less about insights into increasing profits than insights into improving the user
experience.
• Ensure insights are clear and precise.
• E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual
review."
Derive Insights from Findings

89
• Explainability/interpretability is one factor that drives your conclusions.
• An explainable process is one whose inner workings are identifiable and
can be communicated.
• Often, you must be able to explain why a model produced a result.
• Proves model's skill.
• Makes decisions more defensible.
• Allays concerns people have for automation.
• Some algorithms are "black boxes" and can't be easily interpreted.
• E.g., neural networks.
• Many algorithms are explainable, however.
• There are several ways to explain them.
Explainability

Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Recommended

Recommended

More Related Content

Similar to Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Similar to Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx (20)

Recently uploaded

Recently uploaded (20)

Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx