4. Companies cannot
survive without
data
Companies can
drown in too
much data
You can be
surrounded by
data that you
can’t use
Data flows
everywhere
Data gets dirty
and stale if
left unattended
Expensive data
may not be
better data
Packaging
matters
Data management
is a long term
project
Data quality
should be fit
for purpose
Clean data at
the source
Ten Ways Data is Like Water | How to Leverage Data
5. Business Driver
1) The digitization of society;
2) The plummeting of technology costs;
3) Connectivity through cloud computing;
4) Increased knowledge about data science;
5) Social media applications;
6) The upcoming Internet-of-Things (IoT).
6. The digitization of society
• Big Data is largely consumer driven and consumer
oriented.
• Most of the data in the world is generated by
consumers, who are nowadays ʻalways-onʼ.
• Most people now spend 4-6 hours per day consuming
and generating data through a variety of devices and
(social) applications.
• With every click, swipe or message, new data is
created in a database somewhere around the world.
Because everyone now has a smartphone in their
pocket, the data creation sums to incomprehensible
amounts.
• Some studies estimate that 60% of data was
generated within the last two years, which is a good
indication of the rate with which society has digitized.
7. The plummeting of technology costs
• The costs of data
storage and processors
keep declining, making
it possible for small
businesses and
individuals to become
involved with Big Data.
• For storage capacity,
the often-cited Mooreʼs
Law still holds that the
storage density (and
therefore capacity) still
doubles every two
years.
8. Connectivity through cloud computing
• Cloud computing environments (where data
is remotely stored in distributed storage
systems) have made it possible to quickly
scale up or scale down IT infrastructure and
facilitate a payas-you-go model.
• This means that organizations that want to
process massive quantities of 25 data (and
thus have large storage and processing
requirements) do not have to invest in large
quantities of IT infrastructure.
• Instead, they can license the storage and
processing capacity they need and only pay
for the amounts they actually used.
9. Increased knowledge about data science
• In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years.
• The demand for data scientist (and similar job titles) has increased tremendously
and many people have actively become engaged in the domain of data science.
11. Social media applications
Social media data provides insights
into the behaviors, preferences
and opinions of ʻthe publicʼ on a
scale that has never been known
before.
Due to this, it is immensely
valuable to anyone who is able to
derive meaning from these large
quantities of data.
Social media data can be used to
identify customer preferences for
product development, target new
customers for future purchases,
or even target potential voters in
elections
12. The upcoming Internet-of-Things (IoT).
• The Internet of things
(IoT) is the network of
physical devices,
vehicles, home
appliances and other
items embedded with
electronics, software,
sensors, actuators, and
network connectivity
which enables these
objects to connect and
exchange data
13. Data is valuable
• Data is an assetwith unique
properties
• The value of data can and should
be expressed in economic
terms
Dama-DMBoK
14. Data and Information
• The Strategic Alignment Model
(Henderson and Venkatraman,
1999) abstracts the fundamental
drivers for any approach to data
management.
• At its center is the relationship
between data and information.
• Information is most often
associated with business
strategy and the operational use
of data
15. Data Governance and Data Management
• Data Governance (DG) is defined as the exercise of authority and control
(planning, monitoring, and enforcement) over the management of data
assets.
• All organizations make decisions about data, regardless of whether they have
a formal Data Governance function.
• The Data Governance function guides all other data management functions.
• The purpose of Data Governance is to ensure that data is managed properly,
according to policies and best practices (Ladley, 2012)
Data Management Requirements are
Business Requirements
• Managing data means managing the
quality of data
• It takes Metadata to manage data
• It takes planning to manage data
• Data Management requirements must
drive Information
Technology Decisions Data Management
depends on diverse skills
• Data Management is cross-functional
• Data management requires an
enterprise perspective
• Data management must account for a
range perspectives
Data Management is lifecycle management
• Different types of data have different
lifecycle characteristics
• Managing data includes managing the
risks associated with data
19. Database administrator
A database administrator
implements and
manages the operational
aspects of cloud-native
and hybrid data platform
solutions that are built on
Database Server
20.
21. Business Analyst
While some similarities exist between a
data analyst and business analyst, the
key differentiator between the two
roles is what they do with data.
1. A business analyst is closer to
the business and is a specialist
in interpreting the data that
comes from the visualization.
2. Often, the roles of data analyst
and business analyst could be
the responsibility of a single
person.
22. Data analyst
A data analyst
enables
businesses to
maximize the
value of their
data assets
through
visualization
and reporting
tools such as
Microsoft
Power BI
Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements
25. Data engineer
1. Data engineers provision and set up data platform technologies that are on-premises and in the cloud.
2. They manage and secure the flow of structured and unstructured data from multiple sources.
3. The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores.
4. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
27. Data scientist
1. Data scientists perform
advanced analytics to
extract value from data.
2. Their work can vary from
descriptive analytics to
predictive analytics.
3. Descriptive analytics
evaluate data through a
process known as
exploratory data analysis
(EDA).
4. Predictive analytics are
used in machine learning
to apply modeling
techniques that can
detect anomalies or
patterns.
5. These analytics are
important parts of forecast
models.
29. Machine Learning and Data Science
29
Machine
Learning
Deep
Learning
Artificial
Intelligence
Data
Science
30. The Data Science Process
30
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
31. 31
• "Understandable": can be defined in terms of business needs.
• "Actionable" can offer high-level direction as to how to approach a solution.
• Frame the problem.
• Description of problem, written clearly so it can be handed off to others.
• Identify why the problem must be solved.
• Rationale
• Benefits
• Lifetime and use
• Provide background information.
• Assumptions (e.g., acceptable data, operating requirements, business contexts, etc.).
• Reference problems (i.e., similar problems you've solved before).
• Determine whether the problem is appropriate for data science.
• Some problems are more easily solved using traditional methods.
• Data science can be difficult and expensive.
• Data science must be justified as the optimal approach.
Problem Formulation
Problem formulation: The process of identifying an issue that should be addressed,
and putting that issue in terms that are understandable and actionable.
32. Identify & Collect
32
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
33. 33
• Data does not always start out in a neatly packaged form.
• Individual pieces might span multiple repositories.
• Data might be mixed in with irrelevant or dissimilar data.
• You'll need to place data with one or more sets.
• Example: Data repository with information about salespeople.
• Data describing the same ideas is already in one place.
• May be considered a dataset.
• Example: Database A has customer demographics, B has actual transaction info.
• Data is spread out and takes different forms.
• Will be difficult to work with as is.
• Must be placed into one or more sets.
• Datasets can include any kind of data that's relevant to your goals.
• Might be unique to your industry/organization.
Datasets
Dataset: A collection of data that can be used to accomplish business goals.
34. 34
• Structured:
• Facilitates searching, filtering, or extracting data.
• E.g., spreadsheet or database.
• Chunks of data can be retrieved using a programming or querying language.
• Unstructured:
• Not easy to query.
• E.g., images, video, textual contents, etc.
• Usually a larger proportion of data than structured data.
• Semi-structured:
• Aspects of both structured and unstructured.
• E.g., email content is unstructured, but email fields are structured.
• Some formats (like XML and JSON documents) can be in different forms.
• Server log output could be structured.
• Human-authored documents may not be structured.
Structure of Data
36. 36
Preliminary Data Transformation
• Transformation comes after extraction in ETL.
• Involves changing data in some way.
• You can make changes early.
• You know what to look for.
• Your tools can help you find issues.
• However, you won't be able to make some changes until after analysis.
• Tells you what to transform and how.
• Changes at this point are preliminary.
• You can still get quite a bit done.
37. 37
• Data preparation alters data so it more effectively supports data science tasks.
• Tasks like analysis and modeling.
• Necessary for achieving business goals.
• Comprises multiple individual tasks.
• Purpose is to identify issues before data is loaded into its destination.
• Issues can be at a macro level or micro level.
• Data cleaning addresses inaccuracies and other problems with data.
• Subset of preparation.
• Duplicated data, poorly formatted data, corrupt data, etc.
• You can correct data or remove it.
• Choice of action depends on feasibility and impact on later processes.
• Data wrangling/munging are alternative terms.
• Often refers to manual work.
• Preparation can be automated so cleanup can repeat.
Data Preparation and Cleaning
39. 39
• Purpose is to maximize insights gleaned from the data.
• Objectives:
• Test and evaluate prior assumptions.
• Reveal underlying data structure.
• Determine important features/factors.
• Identify unwanted elements.
• Determine best path forward.
• EDA is flexible and often employs visualizations.
• Enables more open-ended exploration.
• Incorporates plots of raw data and summarized data.
• Arranging multiple plots can make it easier to recognize patterns.
• EDA is valuable at every step of the process.
• Changing the data or applying it could prompt EDA.
Exploratory Data Analysis
Exploratory data analysis: A data science approach to closely examining data in order to reveal
new information.
40. 40
• Start by getting familiar with content and format of data.
• Try to identify:
• Number of columns
• Names of columns
• Data types of columns
• Number of rows
• Primary row identifiers
• Value representation
• Presence/number of missing values
• Use Python DataFrame functions:
• info() to get attributes.
• head() to get first few rows.
Dataset Content and Format
41. 41
• Produces a value between +1 (positive correlation) and −1 (negative correlation).
• Shows the strength of their dependence on each other.
Correlation Coefficient
Pearson correlation coefficient: A measurement of the linear correlation between two variables
commonly called x and y.
Positive Correlation No Correlation Negative Correlation
r = 0.6 r = 0 r = −0.8
58. Heatmaps
58
Areas with fewer houses are shown in purple.
Areas with more houses are shown in green.
Heatmap superimposed on a geographical map
Correlation matrix shown in a heatmap
Data pairs with lower correlation are shown in lighter shades.
Data pairs with higher correlation are shown in darker shades.
59. You Are Here (Process, Analyze, Train Models)
59
Frame the
Problem
Identify &
Collect Data
Process Data Analyze Data
Train Models
Finalize the
Project
61. The Bias–Variance Tradeoff
61
• High bias:
• May underfit the training set
• More simplistic
• Less likely to be influenced by true relationships
between features and target outputs
• The sweet spot:
• Good enough fit on training datasets
• Just complex enough
• Skillful in finding true relationships between
features and target outputs while not overly
influenced by noise
• High variance:
• May overfit the training set
• More complex
• More likely to be influenced by false relationships
between features and target outputs ("noise")
Model Complexity
Error
The Sweet Spot
71. 71
Naïve Bayes
𝜎(𝑡) =
1
1 + 𝑒−𝑡
Bayes' theorem—Used by naïve Bayes classifiers for class probability estimation.
Where:
• y is the observed classification.
• x is the vector of dataset features.
• p(y|x) is the likelihood of y given x (posterior probability).
• p(x|y) is the likelihood of x given y.
• p(y) is the probability of y independent of the data (prior probability).
• p(x) is the probability of x independent of the data.
𝑝 𝑦|𝐱 =
𝑝 𝐱|𝑦 𝑝 𝑦
𝑝 𝐱
79. 79
• Drive failure model accuracy: ~98%.
• Intuitive, but often unreliable.
• You can have a high accuracy even if the model doesn't
excel at its purpose.
• Only really suitable in balanced datasets.
Accuracy
Correct estimations
All estimations
True Positives
+ True Negatives
+ False Positives
+ False Negatives
True Positives
+ True Negatives
TP + TN
TP + TN + FP + FN
80. 80
• Drive failure model precision: 68%.
• More useful than accuracy in unbalanced datasets.
• Doesn't account for false negatives.
• Just one drive malfunctioning is undesirable.
• You could set your tolerance for false negatives higher, but
precision will still come up short.
Precision
Correct positive estimations
All positive estimations
True Positives
+ False Positives
True Positives
TP
TP + FP
81. 81
• Drive failure model recall: 81%.
• Minimizes false negatives.
• You could predict all drives will fail, making recall 100%,
but the model would be useless.
• Not as good as precision at minimizing false positives.
Recall
Correct positive estimations
All relevant instances
True Positives
+ False Negatives
True Positives
TP
TP + FN
83. F₁ Score
83
• Precision and recall are more useful in unbalanced datasets.
• They come with a tradeoff.
• Not always clear which metric is more useful.
• A false positive may be just as undesirable as a false negative.
• F₁ score helps you find optimal combination of both precision and recall.
𝐹1 = 2
precision ∙ recall
precision + recall
𝐹1 = 2
.87 ∙ .79
.87 + .79
• Resulting F₁ score is around 83%.
84. 84
• Drive failure model specificity: 98%.
• Maximizes true negatives.
• Not useful in all cases, especially with imbalanced datasets.
• Customer attrition scenario based on satisfaction with a new product:
• Satisfaction is positive, lack of satisfaction is negative.
• Responses are balanced.
• Maximize true negatives to reduce attrition from unsatisfied customers.
• Might be a good case for specificity.
Specificity
Correct negative estimations
All actual negatives
True Negatives
+ False Positives
True Negatives
TN
TN + FP
85. Receiver Operating Characteristic (ROC) Curve
85
Model A
Random
guess
Model B
1.0
0.0
1.0
0.0
False Positive Rate
True
Positive
Rate
FP
FP + TN
87. Know Your Audience
87
• Your audience might be:
• Just you, no reporting needed
• A single person
• A small group of stakeholders
• An entire organization
• You may need to adjust your reporting for:
• Different knowledge
• Different needs
• Different expectations
88. 88
• Findings must be translated into business insights to demonstrate their value.
• Begin by reviewing the overall process and results.
• Ask yourself:
• What did I know before the project?
• What do I know now?
• How does my analysis supplement my knowledge?
• How do the models I built address issues?
• How do the results align with KPIs?
• What business actions can be taken?
• How can I improve the data science process in the future?
• Ensure insights are both relevant and in context.
• E.g., customers care less about insights into increasing profits than insights into improving the user
experience.
• Ensure insights are clear and precise.
• E.g., a classifier "is 95% accurate and will save 20 work hours in a week as compared to current manual
review."
Derive Insights from Findings
89. 89
• Explainability/interpretability is one factor that drives your conclusions.
• An explainable process is one whose inner workings are identifiable and
can be communicated.
• Often, you must be able to explain why a model produced a result.
• Proves model's skill.
• Makes decisions more defensible.
• Allays concerns people have for automation.
• Some algorithms are "black boxes" and can't be easily interpreted.
• E.g., neural networks.
• Many algorithms are explainable, however.
• There are several ways to explain them.
Explainability