Data Exploration
Dr. Windu Gata, M.Kom
SDN 5 Pagi Pondok
Pinang,
SMP 31 PGRI, SMP
178 Rempoa, SMA 86
1991
1995
1997
1997 1999
2006
2006
2008
2003
2011
2014
2022
Dr. Windu Gata, M.Kom
Internal Trainer Multimatics-Karya Talents 2023 IT Consultant since 1995 Lecturer since 2003 Computer Researcher since 2008
Water
gives
life
Water is
abudant
Water is
purified
Water is
distributed
Water is
democratic
Water is
fresh
Water is
human
https://medium.com/citizenme/data-is-the-new-water-seven-reasons-why-45511bc5b9bd
Water cycle - Wikipedia
Companies cannot
survive without
data
Companies can
drown in too
much data
You can be
surrounded by
data that you
can’t use
Data flows
everywhere
Data gets dirty
and stale if
left unattended
Expensive data
may not be
better data
Packaging
matters
Data management
is a long term
project
Data quality
should be fit
for purpose
Clean data at
the source
Ten Ways Data is Like Water | How to Leverage Data
Business Driver
1) The digitization of society;
2) The plummeting of technology costs;
3) Connectivity through cloud computing;
4) Increased knowledge about data science;
5) Social media applications;
6) The upcoming Internet-of-Things (IoT).
The digitization of society
• Big Data is largely consumer driven and consumer
oriented.
• Most of the data in the world is generated by
consumers, who are nowadays ʻalways-onʼ.
• Most people now spend 4-6 hours per day consuming
and generating data through a variety of devices and
(social) applications.
• With every click, swipe or message, new data is
created in a database somewhere around the world.
Because everyone now has a smartphone in their
pocket, the data creation sums to incomprehensible
amounts.
• Some studies estimate that 60% of data was
generated within the last two years, which is a good
indication of the rate with which society has digitized.
The plummeting of technology costs
• The costs of data
storage and processors
keep declining, making
it possible for small
businesses and
individuals to become
involved with Big Data.
• For storage capacity,
the often-cited Mooreʼs
Law still holds that the
storage density (and
therefore capacity) still
doubles every two
years.
Connectivity through cloud computing
• Cloud computing environments (where data
is remotely stored in distributed storage
systems) have made it possible to quickly
scale up or scale down IT infrastructure and
facilitate a payas-you-go model.
• This means that organizations that want to
process massive quantities of 25 data (and
thus have large storage and processing
requirements) do not have to invest in large
quantities of IT infrastructure.
• Instead, they can license the storage and
processing capacity they need and only pay
for the amounts they actually used.
Increased knowledge about data science
• In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years.
• The demand for data scientist (and similar job titles) has increased tremendously
and many people have actively become engaged in the domain of data science.
Interest by Region
Social media applications
Social media data provides insights
into the behaviors, preferences
and opinions of ʻthe publicʼ on a
scale that has never been known
before.
Due to this, it is immensely
valuable to anyone who is able to
derive meaning from these large
quantities of data.
Social media data can be used to
identify customer preferences for
product development, target new
customers for future purchases,
or even target potential voters in
elections
The upcoming Internet-of-Things (IoT).
• The Internet of things
(IoT) is the network of
physical devices,
vehicles, home
appliances and other
items embedded with
electronics, software,
sensors, actuators, and
network connectivity
which enables these
objects to connect and
exchange data
Data is valuable
• Data is an assetwith unique
properties
• The value of data can and should
be expressed in economic
terms
Dama-DMBoK
Data and Information
• The Strategic Alignment Model
(Henderson and Venkatraman,
1999) abstracts the fundamental
drivers for any approach to data
management.
• At its center is the relationship
between data and information.
• Information is most often
associated with business
strategy and the operational use
of data
Data Governance and Data Management
• Data Governance (DG) is defined as the exercise of authority and control
(planning, monitoring, and enforcement) over the management of data
assets.
• All organizations make decisions about data, regardless of whether they have
a formal Data Governance function.
• The Data Governance function guides all other data management functions.
• The purpose of Data Governance is to ensure that data is managed properly,
according to policies and best practices (Ladley, 2012)
Data Management Requirements are
Business Requirements
• Managing data means managing the
quality of data
• It takes Metadata to manage data
• It takes planning to manage data
• Data Management requirements must
drive Information
Technology Decisions Data Management
depends on diverse skills
• Data Management is cross-functional
• Data management requires an
enterprise perspective
• Data management must account for a
range perspectives
Data Management is lifecycle management
• Different types of data have different
lifecycle characteristics
• Managing data includes managing the
risks associated with data
https://gambarpesona.blogspot.com/
Roles (Career)
Database administrator
A database administrator
implements and
manages the operational
aspects of cloud-native
and hybrid data platform
solutions that are built on
Database Server
Business Analyst
While some similarities exist between a
data analyst and business analyst, the
key differentiator between the two
roles is what they do with data.
1. A business analyst is closer to
the business and is a specialist
in interpreting the data that
comes from the visualization.
2. Often, the roles of data analyst
and business analyst could be
the responsibility of a single
person.
Data analyst
A data analyst
enables
businesses to
maximize the
value of their
data assets
through
visualization
and reporting
tools such as
Microsoft
Power BI
Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements
Data Visualization
Data engineer
1. Data engineers provision and set up data platform technologies that are on-premises and in the cloud.
2. They manage and secure the flow of structured and unstructured data from multiple sources.
3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores.
4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
Data Engineer Skill
Data scientist
1. Data scientists perform
advanced analytics to
extract value from data.
2. Their work can vary from
descriptive analytics to
predictive analytics.
3. Descriptive analytics
evaluate data through a
process known as
exploratory data analysis
(EDA).
4. Predictive analytics are
used in machine learning
to apply modeling
techniques that can
detect anomalies or
patterns.
5. These analytics are
important parts of forecast
models.
Artificial Intelligence
Machine Learning and Data Science
29
Machine
Learning
Deep
Learning
Artificial
Intelligence
Data
Science
The Data Science Process
30
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
31
• "Understandable": can be defined in terms of business needs.
• "Actionable" can offer high-level direction as to how to approach a solution.
• Frame the problem.
• Description of problem, written clearly so it can be handed off to others.
• Identify why the problem must be solved.
• Rationale
• Benefits
• Lifetime and use
• Provide background information.
• Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.).
• Reference problems (i.e., similar problems you've solved before).
• Determine whether the problem is appropriate for data science.
• Some problems are more easily solved using traditional methods.
• Data science can be difficult and expensive.
• Data science must be justified as the optimal approach.
Problem Formulation
Problem formulation: The process of identifying an issue that should be addressed,
and putting that issue in terms that are understandable and actionable.
Identify & Collect
32
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
33
• Data does not always start out in a neatly packaged form.
• Individual pieces might span multiple repositories.
• Data might be mixed in with irrelevant or dissimilar data.
• You'll need to place data with one or more sets.
• Example: Data repository with information about salespeople.
• Data describing the same ideas is already in one place.
• May be considered a dataset.
• Example: Database A has customer demographics, B has actual transaction info.
• Data is spread out and takes different forms.
• Will be difficult to work with as is.
• Must be placed into one or more sets.
• Datasets can include any kind of data that's relevant to your goals.
• Might be unique to your industry/organization.
Datasets
Dataset: A collection of data that can be used to accomplish business goals.
34
• Structured:
• Facilitates searching, filtering, or extracting data.
• E.g., spreadsheet or database.
• Chunks of data can be retrieved using a programming or querying language.
• Unstructured:
• Not easy to query.
• E.g., images, video, textual contents, etc.
• Usually a larger proportion of data than structured data.
• Semi-structured:
• Aspects of both structured and unstructured.
• E.g., email content is unstructured, but email fields are structured.
• Some formats (like XML and JSON documents) can be in different forms.
• Server log output could be structured.
• Human-authored documents may not be structured.
Structure of Data
Process Data
35
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
36
Preliminary Data Transformation
• Transformation comes after extraction in ETL.
• Involves changing data in some way.
• You can make changes early.
• You know what to look for.
• Your tools can help you find issues.
• However, you won't be able to make some changes until after analysis.
• Tells you what to transform and how.
• Changes at this point are preliminary.
• You can still get quite a bit done.
37
• Data preparation alters data so it more effectively supports data science tasks.
• Tasks like analysis and modeling.
• Necessary for achieving business goals.
• Comprises multiple individual tasks.
• Purpose is to identify issues before data is loaded into its destination.
• Issues can be at a macro level or micro level.
• Data cleaning addresses inaccuracies and other problems with data.
• Subset of preparation.
• Duplicated data, poorly formatted data, corrupt data, etc.
• You can correct data or remove it.
• Choice of action depends on feasibility and impact on later processes.
• Data wrangling/munging are alternative terms.
• Often refers to manual work.
• Preparation can be automated so cleanup can repeat.
Data Preparation and Cleaning
Analyze
38
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
39
• Purpose is to maximize insights gleaned from the data.
• Objectives:
• Test and evaluate prior assumptions.
• Reveal underlying data structure.
• Determine important features/factors.
• Identify unwanted elements.
• Determine best path forward.
• EDA is flexible and often employs visualizations.
• Enables more open-ended exploration.
• Incorporates plots of raw data and summarized data.
• Arranging multiple plots can make it easier to recognize patterns.
• EDA is valuable at every step of the process.
• Changing the data or applying it could prompt EDA.
Exploratory Data Analysis
Exploratory data analysis: A data science approach to closely examining data in order to reveal
new information.
40
• Start by getting familiar with content and format of data.
• Try to identify:
• Number of columns
• Names of columns
• Data types of columns
• Number of rows
• Primary row identifiers
• Value representation
• Presence/number of missing values
• Use Python DataFrame functions:
• info() to get attributes.
• head() to get first few rows.
Dataset Content and Format
41
• Produces a value between +1 (positive correlation) and −1 (negative correlation).
• Shows the strength of their dependence on each other.
Correlation Coefficient
Pearson correlation coefficient: A measurement of the linear correlation between two variables
commonly called x and y.
Positive Correlation No Correlation Negative Correlation
r = 0.6 r = 0 r = −0.8
Correlation Strength
42
Strong Positive
Correlation
Strong Negative
Correlation
Weak Positive
Correlation
Weak Negative
Correlation
Frequency Distribution (Fruit Example)
43
0
20
40
60
80
100
Apples Bananas Grapes Oranges Pears
Fruit Frequency
Frequency Distribution (Height Example)
44
0
20
40
60
80
100
Height (Inches) Frequency
Probability Distribution (Fruit Example)
45
0
0.1
0.2
0.3
0.4
Apples Bananas Grapes Oranges Pears
Fruit Probability
Probability Distribution (Height Example)
46
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability
47
• Bell shaped
• Symmetrical
• Centered
• Unimodal
Normal Distribution
Frequency
or
Probability
Variable Values
48
Normal Distribution (Height Example)
0
0.04
0.08
0.12
0.16
0.2
Height (Inches) Probability
49
Non-Normal Distributions
Skewed
Distributions
Multi-Modal
Distributions
Standard Deviation Comparison
50
Mean: 67″
STDEV: 5
STDEV: 20
Standard Deviations in a Normal Distribution
(Height Example)
51
Standard Deviations
0
−3 3
2
1
−1
−2
68%
95%
99.7%
67″
57″ 77″
47″ 87″
37″ 97″
Skewness
52
Symmetrical
Positive Skew Negative Skew
Median
Mode Mean
Median
Mean Mode
Mean,
Median,
and Mode
Box Plots
53
Outlier
Median
Min
Max
Q1 Q3
Violin Plot
54
Low
probability
High
probability
Median
Line Plots
55
Trend line
Data point
Area Plots
56
Trend areas
Geographical Maps
57
Most expensive
homes
Less expensive
homes
Heatmaps
58
Areas with fewer houses are shown in purple.
Areas with more houses are shown in green.
Heatmap superimposed on a geographical map
Correlation matrix shown in a heatmap
Data pairs with lower correlation are shown in lighter shades.
Data pairs with higher correlation are shown in darker shades.
You Are Here (Process, Analyze, Train Models)
59
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
Machine Learning Algorithms
60
The Bias–Variance Tradeoff
61
• High bias:
• May underfit the training set
• More simplistic
• Less likely to be influenced by true relationships
between features and target outputs
• The sweet spot:
• Good enough fit on training datasets
• Just complex enough
• Skillful in finding true relationships between
features and target outputs while not overly
influenced by noise
• High variance:
• May overfit the training set
• More complex
• More likely to be influenced by false relationships
between features and target outputs ("noise")
Model Complexity
Error
The Sweet Spot
Holdout Method
62
Predictive Model
Training Set
Validation
Set
Test
Set
Algorithm
Learning
Holdout Sets
Original Data
k-Means Clustering (Slide 1 of 2)
63
Centroids
1.0
0.0
1.0
0.0
K-Means Samples
Linear Regression
65
50
0
100
0
Independent Variable
Dependent
Variable
Linear Regression Sample
Support-Vector Machines (SVMs)
67
Class 0
Class 1
0.0 2.0
1.0
1.0
2.0
0.0
Decision boundary
Support-vector
margin
Support-vector
margin
Support
vector
Support
vector
SVM Samples
Customer Retention Example Tree
69
Satisfied <= 0.5
Samples = 8
T: 3 | F: 5
Class: Returning
Customer age <= 0.5
Samples = 5
T: 1 | F: 4
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 2 | F: 1
Class: Not returning
Samples = 1
T: 1 | F: 0
Class: Not returning
Samples = 2
T: 1 | F: 1
Class: Not returning
Samples = 2
T: 0 | F: 2
Class: Returning
Initial purchase <= 0.5
Samples = 3
T: 1 | F: 2
Class: Returning
Samples = 1
T: 0 | F: 1
Class: Returning
Samples = 2
T: 1 | F: 1
Class: Not returning
Tree - Examples
71
Naïve Bayes
𝜎(𝑡) =
1
1 + 𝑒−𝑡
Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation.
Where:
• y is the observed classification.
• x is the vector of dataset features.
• p(y|x) is the likelihood of y given x (posterior probability).
• p(x|y) is the likelihood of x given y.
• p(y) is the probability of y independent of the data (prior probability).
• p(x) is the probability of x independent of the data.
𝑝 𝑦|𝐱 =
𝑝 𝐱|𝑦 𝑝 𝑦
𝑝 𝐱
Naïve Bayes Samples
k-Nearest Neighbor (k-NN)
73
Class 0
Class 1
Example
Class 0
wins vote
𝑘 = 3
Max Depth
Surface
Area
K-NN Samples
Association
Association Sample
Deep Learning
Confusion Matrix
78
Estimation
No Yes
Actual
No
True
negatives
False
positives
Yes
False
negatives
True
positives
Estimation
Device
didn't fail
Device
failed
Actual
Device
didn't fail
513 8
Device
failed
4 17
79
• Drive failure model accuracy: ~98%.
• Intuitive, but often unreliable.
• You can have a high accuracy even if the model doesn't
excel at its purpose.
• Only really suitable in balanced datasets.
Accuracy
Correct estimations
All estimations
True Positives
+ True Negatives
+ False Positives
+ False Negatives
True Positives
+ True Negatives
TP + TN
TP + TN + FP + FN
80
• Drive failure model precision: 68%.
• More useful than accuracy in unbalanced datasets.
• Doesn't account for false negatives.
• Just one drive malfunctioning is undesirable.
• You could set your tolerance for false negatives higher, but
precision will still come up short.
Precision
Correct positive estimations
All positive estimations
True Positives
+ False Positives
True Positives
TP
TP + FP
81
• Drive failure model recall: 81%.
• Minimizes false negatives.
• You could predict all drives will fail, making recall 100%,
but the model would be useless.
• Not as good as precision at minimizing false positives.
Recall
Correct positive estimations
All relevant instances
True Positives
+ False Negatives
True Positives
TP
TP + FN
Precision–Recall Tradeoff
82
1.0
0.0
0.0
Recall
Precision
High precision,
low recall
Low precision,
high recall
1.0
F₁ Score
83
• Precision and recall are more useful in unbalanced datasets.
• They come with a tradeoff.
• Not always clear which metric is more useful.
• A false positive may be just as undesirable as a false negative.
• F₁ score helps you find optimal combination of both precision and recall.
𝐹1 = 2
precision ∙ recall
precision + recall
𝐹1 = 2
.87 ∙ .79
.87 + .79
• Resulting F₁ score is around 83%.
84
• Drive failure model specificity: 98%.
• Maximizes true negatives.
• Not useful in all cases, especially with imbalanced datasets.
• Customer attrition scenario based on satisfaction with a new product:
• Satisfaction is positive, lack of satisfaction is negative.
• Responses are balanced.
• Maximize true negatives to reduce attrition from unsatisfied customers.
• Might be a good case for specificity.
Specificity
Correct negative estimations
All actual negatives
True Negatives
+ False Positives
True Negatives
TN
TN + FP
Receiver Operating Characteristic (ROC) Curve
85
Model A
Random
guess
Model B
1.0
0.0
1.0
0.0
False Positive Rate
True
Positive
Rate
FP
FP + TN
Finalize
86
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
Know Your Audience
87
• Your audience might be:
• Just you, no reporting needed
• A single person
• A small group of stakeholders
• An entire organization
• You may need to adjust your reporting for:
• Different knowledge
• Different needs
• Different expectations
88
• Findings must be translated into business insights to demonstrate their value.
• Begin by reviewing the overall process and results.
• Ask yourself:
• What did I know before the project?
• What do I know now?
• How does my analysis supplement my knowledge?
• How do the models I built address issues?
• How do the results align with KPIs?
• What business actions can be taken?
• How can I improve the data science process in the future?
• Ensure insights are both relevant and in context.
• E.g., customers care less about insights into increasing profits than insights into improving the user
experience.
• Ensure insights are clear and precise.
• E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual
review."
Derive Insights from Findings
89
• Explainability/interpretability is one factor that drives your conclusions.
• An explainable process is one whose inner workings are identifiable and
can be communicated.
• Often, you must be able to explain why a model produced a result.
• Proves model's skill.
• Makes decisions more defensible.
• Allays concerns people have for automation.
• Some algorithms are "black boxes" and can't be easily interpreted.
• E.g., neural networks.
• Many algorithms are explainable, however.
• There are several ways to explain them.
Explainability
Web App
90
Consultant
Maturity Level – Big Data
Maturity Level – DaMa DMBOK
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx

  • 1.
  • 2.
    SDN 5 PagiPondok Pinang, SMP 31 PGRI, SMP 178 Rempoa, SMA 86 1991 1995 1997 1997 1999 2006 2006 2008 2003 2011 2014 2022 Dr. Windu Gata, M.Kom Internal Trainer Multimatics-Karya Talents 2023 IT Consultant since 1995 Lecturer since 2003 Computer Researcher since 2008
  • 3.
    Water gives life Water is abudant Water is purified Wateris distributed Water is democratic Water is fresh Water is human https://medium.com/citizenme/data-is-the-new-water-seven-reasons-why-45511bc5b9bd Water cycle - Wikipedia
  • 4.
    Companies cannot survive without data Companiescan drown in too much data You can be surrounded by data that you can’t use Data flows everywhere Data gets dirty and stale if left unattended Expensive data may not be better data Packaging matters Data management is a long term project Data quality should be fit for purpose Clean data at the source Ten Ways Data is Like Water | How to Leverage Data
  • 5.
    Business Driver 1) Thedigitization of society; 2) The plummeting of technology costs; 3) Connectivity through cloud computing; 4) Increased knowledge about data science; 5) Social media applications; 6) The upcoming Internet-of-Things (IoT).
  • 6.
    The digitization ofsociety • Big Data is largely consumer driven and consumer oriented. • Most of the data in the world is generated by consumers, who are nowadays ʻalways-onʼ. • Most people now spend 4-6 hours per day consuming and generating data through a variety of devices and (social) applications. • With every click, swipe or message, new data is created in a database somewhere around the world. Because everyone now has a smartphone in their pocket, the data creation sums to incomprehensible amounts. • Some studies estimate that 60% of data was generated within the last two years, which is a good indication of the rate with which society has digitized.
  • 7.
    The plummeting oftechnology costs • The costs of data storage and processors keep declining, making it possible for small businesses and individuals to become involved with Big Data. • For storage capacity, the often-cited Mooreʼs Law still holds that the storage density (and therefore capacity) still doubles every two years.
  • 8.
    Connectivity through cloudcomputing • Cloud computing environments (where data is remotely stored in distributed storage systems) have made it possible to quickly scale up or scale down IT infrastructure and facilitate a payas-you-go model. • This means that organizations that want to process massive quantities of 25 data (and thus have large storage and processing requirements) do not have to invest in large quantities of IT infrastructure. • Instead, they can license the storage and processing capacity they need and only pay for the amounts they actually used.
  • 9.
    Increased knowledge aboutdata science • In the last decade, the term data science and data scientist have become tremendously popular. In October 2012, Harvard Business Review called the data scientist “sexiest job of the 21st century” and many other publications have featured this new job role in recent years. • The demand for data scientist (and similar job titles) has increased tremendously and many people have actively become engaged in the domain of data science.
  • 10.
  • 11.
    Social media applications Socialmedia data provides insights into the behaviors, preferences and opinions of ʻthe publicʼ on a scale that has never been known before. Due to this, it is immensely valuable to anyone who is able to derive meaning from these large quantities of data. Social media data can be used to identify customer preferences for product development, target new customers for future purchases, or even target potential voters in elections
  • 12.
    The upcoming Internet-of-Things(IoT). • The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and network connectivity which enables these objects to connect and exchange data
  • 13.
    Data is valuable •Data is an assetwith unique properties • The value of data can and should be expressed in economic terms Dama-DMBoK
  • 14.
    Data and Information •The Strategic Alignment Model (Henderson and Venkatraman, 1999) abstracts the fundamental drivers for any approach to data management. • At its center is the relationship between data and information. • Information is most often associated with business strategy and the operational use of data
  • 15.
    Data Governance andData Management • Data Governance (DG) is defined as the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets. • All organizations make decisions about data, regardless of whether they have a formal Data Governance function. • The Data Governance function guides all other data management functions. • The purpose of Data Governance is to ensure that data is managed properly, according to policies and best practices (Ladley, 2012) Data Management Requirements are Business Requirements • Managing data means managing the quality of data • It takes Metadata to manage data • It takes planning to manage data • Data Management requirements must drive Information Technology Decisions Data Management depends on diverse skills • Data Management is cross-functional • Data management requires an enterprise perspective • Data management must account for a range perspectives Data Management is lifecycle management • Different types of data have different lifecycle characteristics • Managing data includes managing the risks associated with data
  • 17.
  • 18.
  • 19.
    Database administrator A databaseadministrator implements and manages the operational aspects of cloud-native and hybrid data platform solutions that are built on Database Server
  • 21.
    Business Analyst While somesimilarities exist between a data analyst and business analyst, the key differentiator between the two roles is what they do with data. 1. A business analyst is closer to the business and is a specialist in interpreting the data that comes from the visualization. 2. Often, the roles of data analyst and business analyst could be the responsibility of a single person.
  • 22.
    Data analyst A dataanalyst enables businesses to maximize the value of their data assets through visualization and reporting tools such as Microsoft Power BI Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements
  • 24.
  • 25.
    Data engineer 1. Dataengineers provision and set up data platform technologies that are on-premises and in the cloud. 2. They manage and secure the flow of structured and unstructured data from multiple sources. 3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores. 4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
  • 26.
  • 27.
    Data scientist 1. Datascientists perform advanced analytics to extract value from data. 2. Their work can vary from descriptive analytics to predictive analytics. 3. Descriptive analytics evaluate data through a process known as exploratory data analysis (EDA). 4. Predictive analytics are used in machine learning to apply modeling techniques that can detect anomalies or patterns. 5. These analytics are important parts of forecast models.
  • 28.
  • 29.
    Machine Learning andData Science 29 Machine Learning Deep Learning Artificial Intelligence Data Science
  • 30.
    The Data ScienceProcess 30 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 31.
    31 • "Understandable": canbe defined in terms of business needs. • "Actionable" can offer high-level direction as to how to approach a solution. • Frame the problem. • Description of problem, written clearly so it can be handed off to others. • Identify why the problem must be solved. • Rationale • Benefits • Lifetime and use • Provide background information. • Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.). • Reference problems (i.e., similar problems you've solved before). • Determine whether the problem is appropriate for data science. • Some problems are more easily solved using traditional methods. • Data science can be difficult and expensive. • Data science must be justified as the optimal approach. Problem Formulation Problem formulation: The process of identifying an issue that should be addressed, and putting that issue in terms that are understandable and actionable.
  • 32.
    Identify & Collect 32 Framethe Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 33.
    33 • Data doesnot always start out in a neatly packaged form. • Individual pieces might span multiple repositories. • Data might be mixed in with irrelevant or dissimilar data. • You'll need to place data with one or more sets. • Example: Data repository with information about salespeople. • Data describing the same ideas is already in one place. • May be considered a dataset. • Example: Database A has customer demographics, B has actual transaction info. • Data is spread out and takes different forms. • Will be difficult to work with as is. • Must be placed into one or more sets. • Datasets can include any kind of data that's relevant to your goals. • Might be unique to your industry/organization. Datasets Dataset: A collection of data that can be used to accomplish business goals.
  • 34.
    34 • Structured: • Facilitatessearching, filtering, or extracting data. • E.g., spreadsheet or database. • Chunks of data can be retrieved using a programming or querying language. • Unstructured: • Not easy to query. • E.g., images, video, textual contents, etc. • Usually a larger proportion of data than structured data. • Semi-structured: • Aspects of both structured and unstructured. • E.g., email content is unstructured, but email fields are structured. • Some formats (like XML and JSON documents) can be in different forms. • Server log output could be structured. • Human-authored documents may not be structured. Structure of Data
  • 35.
    Process Data 35 Frame the Problem Identify& Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 36.
    36 Preliminary Data Transformation •Transformation comes after extraction in ETL. • Involves changing data in some way. • You can make changes early. • You know what to look for. • Your tools can help you find issues. • However, you won't be able to make some changes until after analysis. • Tells you what to transform and how. • Changes at this point are preliminary. • You can still get quite a bit done.
  • 37.
    37 • Data preparationalters data so it more effectively supports data science tasks. • Tasks like analysis and modeling. • Necessary for achieving business goals. • Comprises multiple individual tasks. • Purpose is to identify issues before data is loaded into its destination. • Issues can be at a macro level or micro level. • Data cleaning addresses inaccuracies and other problems with data. • Subset of preparation. • Duplicated data, poorly formatted data, corrupt data, etc. • You can correct data or remove it. • Choice of action depends on feasibility and impact on later processes. • Data wrangling/munging are alternative terms. • Often refers to manual work. • Preparation can be automated so cleanup can repeat. Data Preparation and Cleaning
  • 38.
    Analyze 38 Frame the Problem Identify & CollectData Process Data Analyze Data Train Models Finalize the Project
  • 39.
    39 • Purpose isto maximize insights gleaned from the data. • Objectives: • Test and evaluate prior assumptions. • Reveal underlying data structure. • Determine important features/factors. • Identify unwanted elements. • Determine best path forward. • EDA is flexible and often employs visualizations. • Enables more open-ended exploration. • Incorporates plots of raw data and summarized data. • Arranging multiple plots can make it easier to recognize patterns. • EDA is valuable at every step of the process. • Changing the data or applying it could prompt EDA. Exploratory Data Analysis Exploratory data analysis: A data science approach to closely examining data in order to reveal new information.
  • 40.
    40 • Start bygetting familiar with content and format of data. • Try to identify: • Number of columns • Names of columns • Data types of columns • Number of rows • Primary row identifiers • Value representation • Presence/number of missing values • Use Python DataFrame functions: • info() to get attributes. • head() to get first few rows. Dataset Content and Format
  • 41.
    41 • Produces avalue between +1 (positive correlation) and −1 (negative correlation). • Shows the strength of their dependence on each other. Correlation Coefficient Pearson correlation coefficient: A measurement of the linear correlation between two variables commonly called x and y. Positive Correlation No Correlation Negative Correlation r = 0.6 r = 0 r = −0.8
  • 42.
    Correlation Strength 42 Strong Positive Correlation StrongNegative Correlation Weak Positive Correlation Weak Negative Correlation
  • 43.
    Frequency Distribution (FruitExample) 43 0 20 40 60 80 100 Apples Bananas Grapes Oranges Pears Fruit Frequency
  • 44.
    Frequency Distribution (HeightExample) 44 0 20 40 60 80 100 Height (Inches) Frequency
  • 45.
    Probability Distribution (FruitExample) 45 0 0.1 0.2 0.3 0.4 Apples Bananas Grapes Oranges Pears Fruit Probability
  • 46.
    Probability Distribution (HeightExample) 46 0 0.04 0.08 0.12 0.16 0.2 Height (Inches) Probability
  • 47.
    47 • Bell shaped •Symmetrical • Centered • Unimodal Normal Distribution Frequency or Probability Variable Values
  • 48.
    48 Normal Distribution (HeightExample) 0 0.04 0.08 0.12 0.16 0.2 Height (Inches) Probability
  • 49.
  • 50.
  • 51.
    Standard Deviations ina Normal Distribution (Height Example) 51 Standard Deviations 0 −3 3 2 1 −1 −2 68% 95% 99.7% 67″ 57″ 77″ 47″ 87″ 37″ 97″
  • 52.
    Skewness 52 Symmetrical Positive Skew NegativeSkew Median Mode Mean Median Mean Mode Mean, Median, and Mode
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
    Heatmaps 58 Areas with fewerhouses are shown in purple. Areas with more houses are shown in green. Heatmap superimposed on a geographical map Correlation matrix shown in a heatmap Data pairs with lower correlation are shown in lighter shades. Data pairs with higher correlation are shown in darker shades.
  • 59.
    You Are Here(Process, Analyze, Train Models) 59 Frame the Problem Identify & Collect Data Process Data Analyze Data Train Models Finalize the Project
  • 60.
  • 61.
    The Bias–Variance Tradeoff 61 •High bias: • May underfit the training set • More simplistic • Less likely to be influenced by true relationships between features and target outputs • The sweet spot: • Good enough fit on training datasets • Just complex enough • Skillful in finding true relationships between features and target outputs while not overly influenced by noise • High variance: • May overfit the training set • More complex • More likely to be influenced by false relationships between features and target outputs ("noise") Model Complexity Error The Sweet Spot
  • 62.
    Holdout Method 62 Predictive Model TrainingSet Validation Set Test Set Algorithm Learning Holdout Sets Original Data
  • 63.
    k-Means Clustering (Slide1 of 2) 63 Centroids 1.0 0.0 1.0 0.0
  • 64.
  • 65.
  • 66.
  • 67.
    Support-Vector Machines (SVMs) 67 Class0 Class 1 0.0 2.0 1.0 1.0 2.0 0.0 Decision boundary Support-vector margin Support-vector margin Support vector Support vector
  • 68.
  • 69.
    Customer Retention ExampleTree 69 Satisfied <= 0.5 Samples = 8 T: 3 | F: 5 Class: Returning Customer age <= 0.5 Samples = 5 T: 1 | F: 4 Class: Returning Initial purchase <= 0.5 Samples = 3 T: 2 | F: 1 Class: Not returning Samples = 1 T: 1 | F: 0 Class: Not returning Samples = 2 T: 1 | F: 1 Class: Not returning Samples = 2 T: 0 | F: 2 Class: Returning Initial purchase <= 0.5 Samples = 3 T: 1 | F: 2 Class: Returning Samples = 1 T: 0 | F: 1 Class: Returning Samples = 2 T: 1 | F: 1 Class: Not returning
  • 70.
  • 71.
    71 Naïve Bayes 𝜎(𝑡) = 1 1+ 𝑒−𝑡 Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation. Where: • y is the observed classification. • x is the vector of dataset features. • p(y|x) is the likelihood of y given x (posterior probability). • p(x|y) is the likelihood of x given y. • p(y) is the probability of y independent of the data (prior probability). • p(x) is the probability of x independent of the data. 𝑝 𝑦|𝐱 = 𝑝 𝐱|𝑦 𝑝 𝑦 𝑝 𝐱
  • 72.
  • 73.
    k-Nearest Neighbor (k-NN) 73 Class0 Class 1 Example Class 0 wins vote 𝑘 = 3 Max Depth Surface Area
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
    79 • Drive failuremodel accuracy: ~98%. • Intuitive, but often unreliable. • You can have a high accuracy even if the model doesn't excel at its purpose. • Only really suitable in balanced datasets. Accuracy Correct estimations All estimations True Positives + True Negatives + False Positives + False Negatives True Positives + True Negatives TP + TN TP + TN + FP + FN
  • 80.
    80 • Drive failuremodel precision: 68%. • More useful than accuracy in unbalanced datasets. • Doesn't account for false negatives. • Just one drive malfunctioning is undesirable. • You could set your tolerance for false negatives higher, but precision will still come up short. Precision Correct positive estimations All positive estimations True Positives + False Positives True Positives TP TP + FP
  • 81.
    81 • Drive failuremodel recall: 81%. • Minimizes false negatives. • You could predict all drives will fail, making recall 100%, but the model would be useless. • Not as good as precision at minimizing false positives. Recall Correct positive estimations All relevant instances True Positives + False Negatives True Positives TP TP + FN
  • 82.
  • 83.
    F₁ Score 83 • Precisionand recall are more useful in unbalanced datasets. • They come with a tradeoff. • Not always clear which metric is more useful. • A false positive may be just as undesirable as a false negative. • F₁ score helps you find optimal combination of both precision and recall. 𝐹1 = 2 precision ∙ recall precision + recall 𝐹1 = 2 .87 ∙ .79 .87 + .79 • Resulting F₁ score is around 83%.
  • 84.
    84 • Drive failuremodel specificity: 98%. • Maximizes true negatives. • Not useful in all cases, especially with imbalanced datasets. • Customer attrition scenario based on satisfaction with a new product: • Satisfaction is positive, lack of satisfaction is negative. • Responses are balanced. • Maximize true negatives to reduce attrition from unsatisfied customers. • Might be a good case for specificity. Specificity Correct negative estimations All actual negatives True Negatives + False Positives True Negatives TN TN + FP
  • 85.
    Receiver Operating Characteristic(ROC) Curve 85 Model A Random guess Model B 1.0 0.0 1.0 0.0 False Positive Rate True Positive Rate FP FP + TN
  • 86.
    Finalize 86 Frame the Problem Identify & CollectData Process Data Analyze Data Train Models Finalize the Project
  • 87.
    Know Your Audience 87 •Your audience might be: • Just you, no reporting needed • A single person • A small group of stakeholders • An entire organization • You may need to adjust your reporting for: • Different knowledge • Different needs • Different expectations
  • 88.
    88 • Findings mustbe translated into business insights to demonstrate their value. • Begin by reviewing the overall process and results. • Ask yourself: • What did I know before the project? • What do I know now? • How does my analysis supplement my knowledge? • How do the models I built address issues? • How do the results align with KPIs? • What business actions can be taken? • How can I improve the data science process in the future? • Ensure insights are both relevant and in context. • E.g., customers care less about insights into increasing profits than insights into improving the user experience. • Ensure insights are clear and precise. • E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual review." Derive Insights from Findings
  • 89.
    89 • Explainability/interpretability isone factor that drives your conclusions. • An explainable process is one whose inner workings are identifiable and can be communicated. • Often, you must be able to explain why a model produced a result. • Proves model's skill. • Makes decisions more defensible. • Allays concerns people have for automation. • Some algorithms are "black boxes" and can't be easily interpreted. • E.g., neural networks. • Many algorithms are explainable, however. • There are several ways to explain them. Explainability
  • 90.
  • 91.
  • 92.
  • 93.