SlideShare a Scribd company logo
1 of 94
Data Exploration
Dr. Windu Gata, M.Kom
SDN 5 Pagi Pondok
Pinang,
SMP 31 PGRI, SMP
178 Rempoa, SMA 86
1991
1995
1997
1997 1999
2006
2006
2008
2003
2011
2014
2022
Dr. Windu Gata, M.Kom
Internal Trainer Multimatics-Karya Talents 2023 IT Consultant since 1995 Lecturer since 2003 Computer Researcher since 2008
Water
gives
life
Water is
abudant
Water is
purified
Water is
distributed
Water is
democratic
Water is
fresh
Water is
human
https://medium.com/citizenme/data-is-the-new-water-seven-reasons-why-45511bc5b9bd
Water cycle - Wikipedia
Companies cannot
survive without
data
Companies can
drown in too
much data
You can be
surrounded by
data that you
can’t use
Data flows
everywhere
Data gets dirty
and stale if
left unattended
Expensive data
may not be
better data
Packaging
matters
Data management
is a long term
project
Data quality
should be fit
for purpose
Clean data at
the source
Ten Ways Data is Like Water | How to Leverage Data
Business Driver
1) The digitization of society;
2) The plummeting of technology costs;
3) Connectivity through cloud computing;
4) Increased knowledge about data science;
5) Social media applications;
6) The upcoming Internet-of-Things (IoT).
The digitization of society
• Big Data is largely consumer driven and consumer
oriented.
• Most of the data in the world is generated by
consumers, who are nowadays ʻalways-onʼ.
• Most people now spend 4-6 hours per day consuming
and generating data through a variety of devices and
(social) applications.
• With every click, swipe or message, new data is
created in a database somewhere around the world.
Because everyone now has a smartphone in their
pocket, the data creation sums to incomprehensible
amounts.
• Some studies estimate that 60% of data was
generated within the last two years, which is a good
indication of the rate with which society has digitized.
The plummeting of technology costs
• The costs of data
storage and processors
keep declining, making
it possible for small
businesses and
individuals to become
involved with Big Data.
• For storage capacity,
the often-cited Mooreʼs
Law still holds that the
storage density (and
therefore capacity) still
doubles every two
years.
Connectivity through cloud computing
• Cloud computing environments (where data
is remotely stored in distributed storage
systems) have made it possible to quickly
scale up or scale down IT infrastructure and
facilitate a payas-you-go model.
• This means that organizations that want to
process massive quantities of 25 data (and
thus have large storage and processing
requirements) do not have to invest in large
quantities of IT infrastructure.
• Instead, they can license the storage and
processing capacity they need and only pay
for the amounts they actually used.
Increased knowledge about data science
• In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years.
• The demand for data scientist (and similar job titles) has increased tremendously
and many people have actively become engaged in the domain of data science.
Interest by Region
Social media applications
Social media data provides insights
into the behaviors, preferences
and opinions of ʻthe publicʼ on a
scale that has never been known
before.
Due to this, it is immensely
valuable to anyone who is able to
derive meaning from these large
quantities of data.
Social media data can be used to
identify customer preferences for
product development, target new
customers for future purchases,
or even target potential voters in
elections
The upcoming Internet-of-Things (IoT).
• The Internet of things
(IoT) is the network of
physical devices,
vehicles, home
appliances and other
items embedded with
electronics, software,
sensors, actuators, and
network connectivity
which enables these
objects to connect and
exchange data
Data is valuable
• Data is an assetwith unique
properties
• The value of data can and should
be expressed in economic
terms
Dama-DMBoK
Data and Information
• The Strategic Alignment Model
(Henderson and Venkatraman,
1999) abstracts the fundamental
drivers for any approach to data
management.
• At its center is the relationship
between data and information.
• Information is most often
associated with business
strategy and the operational use
of data
Data Governance and Data Management
• Data Governance (DG) is defined as the exercise of authority and control
(planning, monitoring, and enforcement) over the management of data
assets.
• All organizations make decisions about data, regardless of whether they have
a formal Data Governance function.
• The Data Governance function guides all other data management functions.
• The purpose of Data Governance is to ensure that data is managed properly,
according to policies and best practices (Ladley, 2012)
Data Management Requirements are
Business Requirements
• Managing data means managing the
quality of data
• It takes Metadata to manage data
• It takes planning to manage data
• Data Management requirements must
drive Information
Technology Decisions Data Management
depends on diverse skills
• Data Management is cross-functional
• Data management requires an
enterprise perspective
• Data management must account for a
range perspectives
Data Management is lifecycle management
• Different types of data have different
lifecycle characteristics
• Managing data includes managing the
risks associated with data
https://gambarpesona.blogspot.com/
Roles (Career)
Database administrator
A database administrator
implements and
manages the operational
aspects of cloud-native
and hybrid data platform
solutions that are built on
Database Server
Business Analyst
While some similarities exist between a
data analyst and business analyst, the
key differentiator between the two
roles is what they do with data.
1. A business analyst is closer to
the business and is a specialist
in interpreting the data that
comes from the visualization.
2. Often, the roles of data analyst
and business analyst could be
the responsibility of a single
person.
Data analyst
A data analyst
enables
businesses to
maximize the
value of their
data assets
through
visualization
and reporting
tools such as
Microsoft
Power BI
Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements
Data Visualization
Data engineer
1. Data engineers provision and set up data platform technologies that are on-premises and in the cloud.
2. They manage and secure the flow of structured and unstructured data from multiple sources.
3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores.
4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
Data Engineer Skill
Data scientist
1. Data scientists perform
advanced analytics to
extract value from data.
2. Their work can vary from
descriptive analytics to
predictive analytics.
3. Descriptive analytics
evaluate data through a
process known as
exploratory data analysis
(EDA).
4. Predictive analytics are
used in machine learning
to apply modeling
techniques that can
detect anomalies or
patterns.
5. These analytics are
important parts of forecast
models.
Artificial Intelligence
Machine Learning and Data Science
29
Machine
Learning
Deep
Learning
Artificial
Intelligence
Data
Science
The Data Science Process
30
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
31
• "Understandable": can be defined in terms of business needs.
• "Actionable" can offer high-level direction as to how to approach a solution.
• Frame the problem.
• Description of problem, written clearly so it can be handed off to others.
• Identify why the problem must be solved.
• Rationale
• Benefits
• Lifetime and use
• Provide background information.
• Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.).
• Reference problems (i.e., similar problems you've solved before).
• Determine whether the problem is appropriate for data science.
• Some problems are more easily solved using traditional methods.
• Data science can be difficult and expensive.
• Data science must be justified as the optimal approach.
Problem Formulation
Problem formulation: The process of identifying an issue that should be addressed,
and putting that issue in terms that are understandable and actionable.
Identify & Collect
32
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
33
• Data does not always start out in a neatly packaged form.
• Individual pieces might span multiple repositories.
• Data might be mixed in with irrelevant or dissimilar data.
• You'll need to place data with one or more sets.
• Example: Data repository with information about salespeople.
• Data describing the same ideas is already in one place.
• May be considered a dataset.
• Example: Database A has customer demographics, B has actual transaction info.
• Data is spread out and takes different forms.
• Will be difficult to work with as is.
• Must be placed into one or more sets.
• Datasets can include any kind of data that's relevant to your goals.
• Might be unique to your industry/organization.
Datasets
Dataset: A collection of data that can be used to accomplish business goals.
34
• Structured:
• Facilitates searching, filtering, or extracting data.
• E.g., spreadsheet or database.
• Chunks of data can be retrieved using a programming or querying language.
• Unstructured:
• Not easy to query.
• E.g., images, video, textual contents, etc.
• Usually a larger proportion of data than structured data.
• Semi-structured:
• Aspects of both structured and unstructured.
• E.g., email content is unstructured, but email fields are structured.
• Some formats (like XML and JSON documents) can be in different forms.
• Server log output could be structured.
• Human-authored documents may not be structured.
Structure of Data
Process Data
35
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
36
Preliminary Data Transformation
• Transformation comes after extraction in ETL.
• Involves changing data in some way.
• You can make changes early.
• You know what to look for.
• Your tools can help you find issues.
• However, you won't be able to make some changes until after analysis.
• Tells you what to transform and how.
• Changes at this point are preliminary.
• You can still get quite a bit done.
37
• Data preparation alters data so it more effectively supports data science tasks.
• Tasks like analysis and modeling.
• Necessary for achieving business goals.
• Comprises multiple individual tasks.
• Purpose is to identify issues before data is loaded into its destination.
• Issues can be at a macro level or micro level.
• Data cleaning addresses inaccuracies and other problems with data.
• Subset of preparation.
• Duplicated data, poorly formatted data, corrupt data, etc.
• You can correct data or remove it.
• Choice of action depends on feasibility and impact on later processes.
• Data wrangling/munging are alternative terms.
• Often refers to manual work.
• Preparation can be automated so cleanup can repeat.
Data Preparation and Cleaning
Analyze
38
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
39
• Purpose is to maximize insights gleaned from the data.
• Objectives:
• Test and evaluate prior assumptions.
• Reveal underlying data structure.
• Determine important features/factors.
• Identify unwanted elements.
• Determine best path forward.
• EDA is flexible and often employs visualizations.
• Enables more open-ended exploration.
• Incorporates plots of raw data and summarized data.
• Arranging multiple plots can make it easier to recognize patterns.
• EDA is valuable at every step of the process.
• Changing the data or applying it could prompt EDA.
Exploratory Data Analysis
Exploratory data analysis: A data science approach to closely examining data in order to reveal
new information.
40
• Start by getting familiar with content and format of data.
• Try to identify:
• Number of columns
• Names of columns
• Data types of columns
• Number of rows
• Primary row identifiers
• Value representation
• Presence/number of missing values
• Use Python DataFrame functions:
• info() to get attributes.
• head() to get first few rows.
Dataset Content and Format
41
• Produces a value between +1 (positive correlation) and −1 (negative correlation).
• Shows the strength of their dependence on each other.
Correlation Coefficient
Pearson correlation coefficient: A measurement of the linear correlation between two variables
commonly called x and y.
Positive Correlation No Correlation Negative Correlation
r = 0.6 r = 0 r = −0.8
Correlation Strength
42
Strong Positive
Correlation
Strong Negative
Correlation
Weak Positive
Correlation
Weak Negative
Correlation
Frequency Distribution (Fruit Example)
43
0
20
40
60
80
100
Apples Bananas Grapes Oranges Pears
Fruit Frequency
Frequency Distribution (Height Example)
44
0
20
40
60
80
100
Height (Inches) Frequency
Probability Distribution (Fruit Example)
45
0
0.1
0.2
0.3
0.4
Apples Bananas Grapes Oranges Pears
Fruit Probability
Probability Distribution (Height Example)
46
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability
47
• Bell shaped
• Symmetrical
• Centered
• Unimodal
Normal Distribution
Frequency
or
Probability
Variable Values
48
Normal Distribution (Height Example)
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability
49
Non-Normal Distributions
Skewed
Distributions
Multi-Modal
Distributions
Standard Deviation Comparison
50
Mean: 67″
STDEV: 5
STDEV: 20
Standard Deviations in a Normal Distribution
(Height Example)
51
Standard Deviations
0
−3 3
2
1
−1
−2
68%
95%
99.7%
67″
57″ 77″
47″ 87″
37″ 97″
Skewness
52
Symmetrical
Positive Skew Negative Skew
Median
Mode Mean
Median
Mean Mode
Mean,
Median,
and Mode
Box Plots
53
Outlier
Median
Min
Max
Q1 Q3
Violin Plot
54
Low
probability
High
probability
Median
Line Plots
55
Trend line
Data point
Area Plots
56
Trend areas
Geographical Maps
57
Most expensive
homes
Less expensive
homes
Heatmaps
58
Areas with fewer houses are shown in purple.
Areas with more houses are shown in green.
Heatmap superimposed on a geographical map
Correlation matrix shown in a heatmap
Data pairs with lower correlation are shown in lighter shades.
Data pairs with higher correlation are shown in darker shades.
You Are Here (Process, Analyze, Train Models)
59
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
Machine Learning Algorithms
60
The Bias–Variance Tradeoff
61
• High bias:
• May underfit the training set
• More simplistic
• Less likely to be influenced by true relationships
between features and target outputs
• The sweet spot:
• Good enough fit on training datasets
• Just complex enough
• Skillful in finding true relationships between
features and target outputs while not overly
influenced by noise
• High variance:
• May overfit the training set
• More complex
• More likely to be influenced by false relationships
between features and target outputs ("noise")
Model Complexity
Error
The Sweet Spot
Holdout Method
62
Predictive Model
Training Set
Validation
Set
Test
Set
Algorithm
Learning
Holdout Sets
Original Data
k-Means Clustering (Slide 1 of 2)
63
Centroids
1.0
0.0
1.0
0.0
K-Means Samples
Linear Regression
65
50
0
100
0
Independent Variable
Dependent
Variable
Linear Regression Sample
Support-Vector Machines (SVMs)
67
Class 0
Class 1
0.0 2.0
1.0
1.0
2.0
0.0
Decision boundary
Support-vector
margin
Support-vector
margin
Support
vector
Support
vector
SVM Samples
Customer Retention Example Tree
69
Satisfied <= 0.5
Samples = 8
T: 3 | F: 5
Class: Returning
Customer age <= 0.5
Samples = 5
T: 1 | F: 4
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 2 | F: 1
Class: Not returning
Samples = 1
T: 1 | F: 0
Class: Not returning
Samples = 2
T: 1 | F: 1
Class: Not returning
Samples = 2
T: 0 | F: 2
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 1 | F: 2
Class: Returning
Samples = 1
T: 0 | F: 1
Class: Returning
Samples = 2
T: 1 | F: 1
Class: Not returning
Tree - Examples
71
Naïve Bayes
𝜎(𝑡) =
1
1 + 𝑒−𝑡
Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation.
Where:
• y is the observed classification.
• x is the vector of dataset features.
• p(y|x) is the likelihood of y given x (posterior probability).
• p(x|y) is the likelihood of x given y.
• p(y) is the probability of y independent of the data (prior probability).
• p(x) is the probability of x independent of the data.
𝑝 𝑦|𝐱 =
𝑝 𝐱|𝑦 𝑝 𝑦
𝑝 𝐱
Naïve Bayes Samples
k-Nearest Neighbor (k-NN)
73
Class 0
Class 1
Example
Class 0
wins vote
𝑘 = 3
Max Depth
Surface
Area
K-NN Samples
Association
Association Sample
Deep Learning
Confusion Matrix
78
Estimation
No Yes
Actual
No
True
negatives
False
positives
Yes
False
negatives
True
positives
Estimation
Device
didn't fail
Device
failed
Actual
Device
didn't fail
513 8
Device
failed
4 17
79
• Drive failure model accuracy: ~98%.
• Intuitive, but often unreliable.
• You can have a high accuracy even if the model doesn't
excel at its purpose.
• Only really suitable in balanced datasets.
Accuracy
Correct estimations
All estimations
True Positives
+ True Negatives
+ False Positives
+ False Negatives
True Positives
+ True Negatives
TP + TN
TP + TN + FP + FN
80
• Drive failure model precision: 68%.
• More useful than accuracy in unbalanced datasets.
• Doesn't account for false negatives.
• Just one drive malfunctioning is undesirable.
• You could set your tolerance for false negatives higher, but
precision will still come up short.
Precision
Correct positive estimations
All positive estimations
True Positives
+ False Positives
True Positives
TP
TP + FP
81
• Drive failure model recall: 81%.
• Minimizes false negatives.
• You could predict all drives will fail, making recall 100%,
but the model would be useless.
• Not as good as precision at minimizing false positives.
Recall
Correct positive estimations
All relevant instances
True Positives
+ False Negatives
True Positives
TP
TP + FN
Precision–Recall Tradeoff
82
1.0
0.0
0.0
Recall
Precision
High precision,
low recall
Low precision,
high recall
1.0
F₁ Score
83
• Precision and recall are more useful in unbalanced datasets.
• They come with a tradeoff.
• Not always clear which metric is more useful.
• A false positive may be just as undesirable as a false negative.
• F₁ score helps you find optimal combination of both precision and recall.
𝐹1 = 2
precision ∙ recall
precision + recall
𝐹1 = 2
.87 ∙ .79
.87 + .79
• Resulting F₁ score is around 83%.
84
• Drive failure model specificity: 98%.
• Maximizes true negatives.
• Not useful in all cases, especially with imbalanced datasets.
• Customer attrition scenario based on satisfaction with a new product:
• Satisfaction is positive, lack of satisfaction is negative.
• Responses are balanced.
• Maximize true negatives to reduce attrition from unsatisfied customers.
• Might be a good case for specificity.
Specificity
Correct negative estimations
All actual negatives
True Negatives
+ False Positives
True Negatives
TN
TN + FP
Receiver Operating Characteristic (ROC) Curve
85
Model A
Random
guess
Model B
1.0
0.0
1.0
0.0
False Positive Rate
True
Positive
Rate
FP
FP + TN
Finalize
86
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
Know Your Audience
87
• Your audience might be:
• Just you, no reporting needed
• A single person
• A small group of stakeholders
• An entire organization
• You may need to adjust your reporting for:
• Different knowledge
• Different needs
• Different expectations
88
• Findings must be translated into business insights to demonstrate their value.
• Begin by reviewing the overall process and results.
• Ask yourself:
• What did I know before the project?
• What do I know now?
• How does my analysis supplement my knowledge?
• How do the models I built address issues?
• How do the results align with KPIs?
• What business actions can be taken?
• How can I improve the data science process in the future?
• Ensure insights are both relevant and in context.
• E.g., customers care less about insights into increasing profits than insights into improving the user
experience.
• Ensure insights are clear and precise.
• E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual
review."
Derive Insights from Findings
89
• Explainability/interpretability is one factor that drives your conclusions.
• An explainable process is one whose inner workings are identifiable and
can be communicated.
• Often, you must be able to explain why a model produced a result.
• Proves model's skill.
• Makes decisions more defensible.
• Allays concerns people have for automation.
• Some algorithms are "black boxes" and can't be easily interpreted.
• E.g., neural networks.
• Many algorithms are explainable, however.
• There are several ways to explain them.
Explainability
Web App
90
Consultant
Maturity Level – Big Data
Maturity Level – DaMa DMBOK
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

More Related Content

Similar to Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures DATAVERSITY
 
Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data Blueprint
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData Blueprint
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingDATAVERSITY
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
 
Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesDATAVERSITY
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of MetadataDATAVERSITY
 
Data Systems Integration & Business Value PT. 3: Warehousing
Data Systems Integration & Business Value PT. 3: Warehousing Data Systems Integration & Business Value PT. 3: Warehousing
Data Systems Integration & Business Value PT. 3: Warehousing Data Blueprint
 
Data-Ed: Data Systems Integration & Business Value Pt. 3: Warehousing
Data-Ed: Data Systems Integration & Business Value Pt. 3: WarehousingData-Ed: Data Systems Integration & Business Value Pt. 3: Warehousing
Data-Ed: Data Systems Integration & Business Value Pt. 3: WarehousingDATAVERSITY
 
Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Angie Jorgensen
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Denodo
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringDATAVERSITY
 
Data-Ed: Metadata Strategies
 Data-Ed: Metadata Strategies Data-Ed: Metadata Strategies
Data-Ed: Metadata StrategiesData Blueprint
 

Similar to Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx (20)

Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
 
Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures
 
Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling Fundamentals
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data Modeling
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data Modeling
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
 
Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of Metadata
 
Data Systems Integration & Business Value PT. 3: Warehousing
Data Systems Integration & Business Value PT. 3: Warehousing Data Systems Integration & Business Value PT. 3: Warehousing
Data Systems Integration & Business Value PT. 3: Warehousing
 
Data-Ed: Data Systems Integration & Business Value Pt. 3: Warehousing
Data-Ed: Data Systems Integration & Business Value Pt. 3: WarehousingData-Ed: Data Systems Integration & Business Value Pt. 3: Warehousing
Data-Ed: Data Systems Integration & Business Value Pt. 3: Warehousing
 
Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...Encrypted Data Management With Deduplication In Cloud...
Encrypted Data Management With Deduplication In Cloud...
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
Data-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality EngineeringData-Ed Webinar: Data Quality Engineering
Data-Ed Webinar: Data Quality Engineering
 
2014 dqe handouts
2014 dqe handouts2014 dqe handouts
2014 dqe handouts
 
Data-Ed: Metadata Strategies
 Data-Ed: Metadata Strategies Data-Ed: Metadata Strategies
Data-Ed: Metadata Strategies
 

Recently uploaded

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

  • 2. SDN 5 Pagi Pondok Pinang, SMP 31 PGRI, SMP 178 Rempoa, SMA 86 1991 1995 1997 1997 1999 2006 2006 2008 2003 2011 2014 2022 Dr. Windu Gata, M.Kom Internal Trainer Multimatics-Karya Talents 2023 IT Consultant since 1995 Lecturer since 2003 Computer Researcher since 2008
  • 3. Water gives life Water is abudant Water is purified Water is distributed Water is democratic Water is fresh Water is human https://medium.com/citizenme/data-is-the-new-water-seven-reasons-why-45511bc5b9bd Water cycle - Wikipedia
  • 4. Companies cannot survive without data Companies can drown in too much data You can be surrounded by data that you can’t use Data flows everywhere Data gets dirty and stale if left unattended Expensive data may not be better data Packaging matters Data management is a long term project Data quality should be fit for purpose Clean data at the source Ten Ways Data is Like Water | How to Leverage Data
  • 5. Business Driver 1) The digitization of society; 2) The plummeting of technology costs; 3) Connectivity through cloud computing; 4) Increased knowledge about data science; 5) Social media applications; 6) The upcoming Internet-of-Things (IoT).
  • 6. The digitization of society • Big Data is largely consumer driven and consumer oriented. • Most of the data in the world is generated by consumers, who are nowadays ʻalways-onʼ. • Most people now spend 4-6 hours per day consuming and generating data through a variety of devices and (social) applications. • With every click, swipe or message, new data is created in a database somewhere around the world. Because everyone now has a smartphone in their pocket, the data creation sums to incomprehensible amounts. • Some studies estimate that 60% of data was generated within the last two years, which is a good indication of the rate with which society has digitized.
  • 7. The plummeting of technology costs • The costs of data storage and processors keep declining, making it possible for small businesses and individuals to become involved with Big Data. • For storage capacity, the often-cited Mooreʼs Law still holds that the storage density (and therefore capacity) still doubles every two years.
  • 8. Connectivity through cloud computing • Cloud computing environments (where data is remotely stored in distributed storage systems) have made it possible to quickly scale up or scale down IT infrastructure and facilitate a payas-you-go model. • This means that organizations that want to process massive quantities of 25 data (and thus have large storage and processing requirements) do not have to invest in large quantities of IT infrastructure. • Instead, they can license the storage and processing capacity they need and only pay for the amounts they actually used.
  • 9. Increased knowledge about data science • In the last decade, the term data science and data scientist have become tremendously popular. In October 2012, Harvard Business Review called the data scientist “sexiest job of the 21st century” and many other publications have featured this new job role in recent years. • The demand for data scientist (and similar job titles) has increased tremendously and many people have actively become engaged in the domain of data science.
  • 11. Social media applications Social media data provides insights into the behaviors, preferences and opinions of ʻthe publicʼ on a scale that has never been known before. Due to this, it is immensely valuable to anyone who is able to derive meaning from these large quantities of data. Social media data can be used to identify customer preferences for product development, target new customers for future purchases, or even target potential voters in elections
  • 12. The upcoming Internet-of-Things (IoT). • The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and network connectivity which enables these objects to connect and exchange data
  • 13. Data is valuable • Data is an assetwith unique properties • The value of data can and should be expressed in economic terms Dama-DMBoK
  • 14. Data and Information • The Strategic Alignment Model (Henderson and Venkatraman, 1999) abstracts the fundamental drivers for any approach to data management. • At its center is the relationship between data and information. • Information is most often associated with business strategy and the operational use of data
  • 15. Data Governance and Data Management • Data Governance (DG) is defined as the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets. • All organizations make decisions about data, regardless of whether they have a formal Data Governance function. • The Data Governance function guides all other data management functions. • The purpose of Data Governance is to ensure that data is managed properly, according to policies and best practices (Ladley, 2012) Data Management Requirements are Business Requirements • Managing data means managing the quality of data • It takes Metadata to manage data • It takes planning to manage data • Data Management requirements must drive Information Technology Decisions Data Management depends on diverse skills • Data Management is cross-functional • Data management requires an enterprise perspective • Data management must account for a range perspectives Data Management is lifecycle management • Different types of data have different lifecycle characteristics • Managing data includes managing the risks associated with data
  • 16.
  • 19. Database administrator A database administrator implements and manages the operational aspects of cloud-native and hybrid data platform solutions that are built on Database Server
  • 20.
  • 21. Business Analyst While some similarities exist between a data analyst and business analyst, the key differentiator between the two roles is what they do with data. 1. A business analyst is closer to the business and is a specialist in interpreting the data that comes from the visualization. 2. Often, the roles of data analyst and business analyst could be the responsibility of a single person.
  • 22. Data analyst A data analyst enables businesses to maximize the value of their data assets through visualization and reporting tools such as Microsoft Power BI Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements
  • 23.
  • 25. Data engineer 1. Data engineers provision and set up data platform technologies that are on-premises and in the cloud. 2. They manage and secure the flow of structured and unstructured data from multiple sources. 3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores. 4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
  • 27. Data scientist 1. Data scientists perform advanced analytics to extract value from data. 2. Their work can vary from descriptive analytics to predictive analytics. 3. Descriptive analytics evaluate data through a process known as exploratory data analysis (EDA). 4. Predictive analytics are used in machine learning to apply modeling techniques that can detect anomalies or patterns. 5. These analytics are important parts of forecast models.
  • 29. Machine Learning and Data Science 29 Machine Learning Deep Learning Artificial Intelligence Data Science
  • 30. The Data Science Process 30 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 31. 31 • "Understandable": can be defined in terms of business needs. • "Actionable" can offer high-level direction as to how to approach a solution. • Frame the problem. • Description of problem, written clearly so it can be handed off to others. • Identify why the problem must be solved. • Rationale • Benefits • Lifetime and use • Provide background information. • Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.). • Reference problems (i.e., similar problems you've solved before). • Determine whether the problem is appropriate for data science. • Some problems are more easily solved using traditional methods. • Data science can be difficult and expensive. • Data science must be justified as the optimal approach. Problem Formulation Problem formulation: The process of identifying an issue that should be addressed, and putting that issue in terms that are understandable and actionable.
  • 32. Identify & Collect 32 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 33. 33 • Data does not always start out in a neatly packaged form. • Individual pieces might span multiple repositories. • Data might be mixed in with irrelevant or dissimilar data. • You'll need to place data with one or more sets. • Example: Data repository with information about salespeople. • Data describing the same ideas is already in one place. • May be considered a dataset. • Example: Database A has customer demographics, B has actual transaction info. • Data is spread out and takes different forms. • Will be difficult to work with as is. • Must be placed into one or more sets. • Datasets can include any kind of data that's relevant to your goals. • Might be unique to your industry/organization. Datasets Dataset: A collection of data that can be used to accomplish business goals.
  • 34. 34 • Structured: • Facilitates searching, filtering, or extracting data. • E.g., spreadsheet or database. • Chunks of data can be retrieved using a programming or querying language. • Unstructured: • Not easy to query. • E.g., images, video, textual contents, etc. • Usually a larger proportion of data than structured data. • Semi-structured: • Aspects of both structured and unstructured. • E.g., email content is unstructured, but email fields are structured. • Some formats (like XML and JSON documents) can be in different forms. • Server log output could be structured. • Human-authored documents may not be structured. Structure of Data
  • 35. Process Data 35 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 36. 36 Preliminary Data Transformation • Transformation comes after extraction in ETL. • Involves changing data in some way. • You can make changes early. • You know what to look for. • Your tools can help you find issues. • However, you won't be able to make some changes until after analysis. • Tells you what to transform and how. • Changes at this point are preliminary. • You can still get quite a bit done.
  • 37. 37 • Data preparation alters data so it more effectively supports data science tasks. • Tasks like analysis and modeling. • Necessary for achieving business goals. • Comprises multiple individual tasks. • Purpose is to identify issues before data is loaded into its destination. • Issues can be at a macro level or micro level. • Data cleaning addresses inaccuracies and other problems with data. • Subset of preparation. • Duplicated data, poorly formatted data, corrupt data, etc. • You can correct data or remove it. • Choice of action depends on feasibility and impact on later processes. • Data wrangling/munging are alternative terms. • Often refers to manual work. • Preparation can be automated so cleanup can repeat. Data Preparation and Cleaning
  • 38. Analyze 38 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 39. 39 • Purpose is to maximize insights gleaned from the data. • Objectives: • Test and evaluate prior assumptions. • Reveal underlying data structure. • Determine important features/factors. • Identify unwanted elements. • Determine best path forward. • EDA is flexible and often employs visualizations. • Enables more open-ended exploration. • Incorporates plots of raw data and summarized data. • Arranging multiple plots can make it easier to recognize patterns. • EDA is valuable at every step of the process. • Changing the data or applying it could prompt EDA. Exploratory Data Analysis Exploratory data analysis: A data science approach to closely examining data in order to reveal new information.
  • 40. 40 • Start by getting familiar with content and format of data. • Try to identify: • Number of columns • Names of columns • Data types of columns • Number of rows • Primary row identifiers • Value representation • Presence/number of missing values • Use Python DataFrame functions: • info() to get attributes. • head() to get first few rows. Dataset Content and Format
  • 41. 41 • Produces a value between +1 (positive correlation) and −1 (negative correlation). • Shows the strength of their dependence on each other. Correlation Coefficient Pearson correlation coefficient: A measurement of the linear correlation between two variables commonly called x and y. Positive Correlation No Correlation Negative Correlation r = 0.6 r = 0 r = −0.8
  • 42. Correlation Strength 42 Strong Positive Correlation Strong Negative Correlation Weak Positive Correlation Weak Negative Correlation
  • 43. Frequency Distribution (Fruit Example) 43 0 20 40 60 80 100 Apples Bananas Grapes Oranges Pears Fruit Frequency
  • 44. Frequency Distribution (Height Example) 44 0 20 40 60 80 100 Height (Inches) Frequency
  • 45. Probability Distribution (Fruit Example) 45 0 0.1 0.2 0.3 0.4 Apples Bananas Grapes Oranges Pears Fruit Probability
  • 46. Probability Distribution (Height Example) 46 0 0.04 0.08 0.12 0.16 0.2 Height (Inches) Probability
  • 47. 47 • Bell shaped • Symmetrical • Centered • Unimodal Normal Distribution Frequency or Probability Variable Values
  • 48. 48 Normal Distribution (Height Example) 0 0.04 0.08 0.12 0.16 0.2 Height (Inches) Probability
  • 50. Standard Deviation Comparison 50 Mean: 67″ STDEV: 5 STDEV: 20
  • 51. Standard Deviations in a Normal Distribution (Height Example) 51 Standard Deviations 0 −3 3 2 1 −1 −2 68% 95% 99.7% 67″ 57″ 77″ 47″ 87″ 37″ 97″
  • 52. Skewness 52 Symmetrical Positive Skew Negative Skew Median Mode Mean Median Mean Mode Mean, Median, and Mode
  • 58. Heatmaps 58 Areas with fewer houses are shown in purple. Areas with more houses are shown in green. Heatmap superimposed on a geographical map Correlation matrix shown in a heatmap Data pairs with lower correlation are shown in lighter shades. Data pairs with higher correlation are shown in darker shades.
  • 59. You Are Here (Process, Analyze, Train Models) 59 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 61. The Bias–Variance Tradeoff 61 • High bias: • May underfit the training set • More simplistic • Less likely to be influenced by true relationships between features and target outputs • The sweet spot: • Good enough fit on training datasets • Just complex enough • Skillful in finding true relationships between features and target outputs while not overly influenced by noise • High variance: • May overfit the training set • More complex • More likely to be influenced by false relationships between features and target outputs ("noise") Model Complexity Error The Sweet Spot
  • 62. Holdout Method 62 Predictive Model Training Set Validation Set Test Set Algorithm Learning Holdout Sets Original Data
  • 63. k-Means Clustering (Slide 1 of 2) 63 Centroids 1.0 0.0 1.0 0.0
  • 67. Support-Vector Machines (SVMs) 67 Class 0 Class 1 0.0 2.0 1.0 1.0 2.0 0.0 Decision boundary Support-vector margin Support-vector margin Support vector Support vector
  • 69. Customer Retention Example Tree 69 Satisfied <= 0.5 Samples = 8 T: 3 | F: 5 Class: Returning Customer age <= 0.5 Samples = 5 T: 1 | F: 4 Class: Returning Initial purchase <= 0.5 Samples = 3 T: 2 | F: 1 Class: Not returning Samples = 1 T: 1 | F: 0 Class: Not returning Samples = 2 T: 1 | F: 1 Class: Not returning Samples = 2 T: 0 | F: 2 Class: Returning Initial purchase <= 0.5 Samples = 3 T: 1 | F: 2 Class: Returning Samples = 1 T: 0 | F: 1 Class: Returning Samples = 2 T: 1 | F: 1 Class: Not returning
  • 71. 71 Naïve Bayes 𝜎(𝑡) = 1 1 + 𝑒−𝑡 Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation. Where: • y is the observed classification. • x is the vector of dataset features. • p(y|x) is the likelihood of y given x (posterior probability). • p(x|y) is the likelihood of x given y. • p(y) is the probability of y independent of the data (prior probability). • p(x) is the probability of x independent of the data. 𝑝 𝑦|𝐱 = 𝑝 𝐱|𝑦 𝑝 𝑦 𝑝 𝐱
  • 73. k-Nearest Neighbor (k-NN) 73 Class 0 Class 1 Example Class 0 wins vote 𝑘 = 3 Max Depth Surface Area
  • 79. 79 • Drive failure model accuracy: ~98%. • Intuitive, but often unreliable. • You can have a high accuracy even if the model doesn't excel at its purpose. • Only really suitable in balanced datasets. Accuracy Correct estimations All estimations True Positives + True Negatives + False Positives + False Negatives True Positives + True Negatives TP + TN TP + TN + FP + FN
  • 80. 80 • Drive failure model precision: 68%. • More useful than accuracy in unbalanced datasets. • Doesn't account for false negatives. • Just one drive malfunctioning is undesirable. • You could set your tolerance for false negatives higher, but precision will still come up short. Precision Correct positive estimations All positive estimations True Positives + False Positives True Positives TP TP + FP
  • 81. 81 • Drive failure model recall: 81%. • Minimizes false negatives. • You could predict all drives will fail, making recall 100%, but the model would be useless. • Not as good as precision at minimizing false positives. Recall Correct positive estimations All relevant instances True Positives + False Negatives True Positives TP TP + FN
  • 83. F₁ Score 83 • Precision and recall are more useful in unbalanced datasets. • They come with a tradeoff. • Not always clear which metric is more useful. • A false positive may be just as undesirable as a false negative. • F₁ score helps you find optimal combination of both precision and recall. 𝐹1 = 2 precision ∙ recall precision + recall 𝐹1 = 2 .87 ∙ .79 .87 + .79 • Resulting F₁ score is around 83%.
  • 84. 84 • Drive failure model specificity: 98%. • Maximizes true negatives. • Not useful in all cases, especially with imbalanced datasets. • Customer attrition scenario based on satisfaction with a new product: • Satisfaction is positive, lack of satisfaction is negative. • Responses are balanced. • Maximize true negatives to reduce attrition from unsatisfied customers. • Might be a good case for specificity. Specificity Correct negative estimations All actual negatives True Negatives + False Positives True Negatives TN TN + FP
  • 85. Receiver Operating Characteristic (ROC) Curve 85 Model A Random guess Model B 1.0 0.0 1.0 0.0 False Positive Rate True Positive Rate FP FP + TN
  • 86. Finalize 86 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 87. Know Your Audience 87 • Your audience might be: • Just you, no reporting needed • A single person • A small group of stakeholders • An entire organization • You may need to adjust your reporting for: • Different knowledge • Different needs • Different expectations
  • 88. 88 • Findings must be translated into business insights to demonstrate their value. • Begin by reviewing the overall process and results. • Ask yourself: • What did I know before the project? • What do I know now? • How does my analysis supplement my knowledge? • How do the models I built address issues? • How do the results align with KPIs? • What business actions can be taken? • How can I improve the data science process in the future? • Ensure insights are both relevant and in context. • E.g., customers care less about insights into increasing profits than insights into improving the user experience. • Ensure insights are clear and precise. • E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual review." Derive Insights from Findings
  • 89. 89 • Explainability/interpretability is one factor that drives your conclusions. • An explainable process is one whose inner workings are identifiable and can be communicated. • Often, you must be able to explain why a model produced a result. • Proves model's skill. • Makes decisions more defensible. • Allays concerns people have for automation. • Some algorithms are "black boxes" and can't be easily interpreted. • E.g., neural networks. • Many algorithms are explainable, however. • There are several ways to explain them. Explainability
  • 92. Maturity Level – Big Data
  • 93. Maturity Level – DaMa DMBOK