Understanding the life cycle of a data analysis project
&
Visualizing your data with Tableau
Paula Muñoz
Telecommunications Professional |Data Enthusiast | Tableau Public Featured Author
@paulisDataViz | munoz.p@husky.neu.edu
1. Career path and education
2. Life Cycle of a data analysis project based on CRISP- DM Methodology
Walk through steps with a sample project and tools
3. Visualizing your data with Tableau
4. Q & A
1
2
34
5
My
Career
Paula Muñoz
Telecommunications Professional |Data Enthusiast | Tableau Public Featured Author
1
Lifelong Learner
References/ Images
Diagram by Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=24930610
Reference: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
Cross-Industry Standard Process for Data
Mining
"A data mining process model that describes
commonly used approaches that data mining experts
use to tackle problems” - Wikipedia
Visualization & Presentation
•Business
Objectives
•Information
needed
•Type of analysis
•Scope of work
•Deliverables
Business Issue Understanding
“The data scientists at BigMart have collected 2013 sales data for
1559 products across 10 stores in different cities. Also, certain
attributes of each product and store have been defined. The aim is
to build a predictive model and find out the sales of each product
at a particular store.
Using this model, BigMart will try to understand the properties of
products and stores which play a key role in increasing sales.”
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
Let’s practice with the Big Mart Sales Dataset!
•Initial Data
Collection
•Data
requirements
•Data
availability
•Data
exploration
and
characteristics
Data Understanding
Variable Description
Item_Identifier Unique product ID
Item_Weight Weight of product
Item_Fat_Content Whether the product is low fat or not
Item_Visibility The % of total display area of all products in a store allocated to the particular
product
Item_Type The category to which the product belongs
Item_MRP Maximum Retail Price (list price) of the product
Outlet_Identifier Unique store ID
Outlet_Establishment_Year The year in which store was established
Outlet_Size The size of the store in terms of ground area covered
Outlet_Location_Type The type of city in which the store is located
Outlet_Type Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales Sales of the product in the particular store. This is the outcome variable to be
predicted.
Per case study:
“We have train (8523) and test (5681) data set, train data set has both input and output variable(s).
You need to predict the sales for test data set.”
12
Variables
Dependent
Variable
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
•Initial Data
collection
•Data
requirements
•Data
availability
•Data
exploration
and
characteristics
Data Exploration and data characteristics
“We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for
test data set.”
Data Understanding
Tools to explore and prepare data:
Excel (for small datasets), Python, R, Alteryx, Tableau Prep, Tableau Desktop
Python
Tableau Prep
Tableau Desktop
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
•Initial Data
collection
•Data
requirements
•Data
availability
•Data
exploration
and
characteristics
Data Understanding
Tableau Prep:
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Combining both “train” and “test” data into one dataset and adding a column to identify the source
13 fields, 14204 rows
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
Identifying missing values
Ok, Target variable
Data Understanding
•Initial Data collection
•Data requirements
•Data availability
•Data exploration and
characteristics
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
Exploring Categorical variables (Dimensions)
•Initial Data collection
•Data needed
•Data availability
•Data exploration and
characteristics
Data Understanding
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
Exploring statistics for numerical variables (Measures)
•Initial Data collection
•Data requirements
•Data availability
•Data exploration and
characteristics
Data Understanding
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
Summary of inferences drawn from variables on Step 2:
•Initial Data collection
•Data requirements
•Data availability
•Data exploration and
characteristics
Data Understanding
• Item_Weight and Outlet_Size have a high number of
“Nulls”/ Missing values
• Item_Visibility shows some minimum values of 0, which
doesn’t make sense as any item occupies some space in a
store.
• Item_Fat_Content has typos/ duplicates on category
names
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
After having an understanding of the data, we will complete and apply the following
methods to clean our data
1. Impute missing values:
• Impute Outlet Size with the mode based on Outlet Type
• Impute Item weight with average weight based on Product/ Item Identifier
• Impute Item Visibility for Items with values of 0 with Average based on Product/ Item Identifier
2. Remove duplicates
• Modify Item Fat Content variable to remove duplicates
3. Create additional variables
• Create a Broad Type category based on first two letters of Item Identifier
• Create an Outlet_Year variable to determine the number of years and Outlet has been in operation
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Imputing missing values
Impute Outlet Size with the mode based Outlet Type
Before After
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Cleaning the data: Imputing other values
Impute Item Visibility with the values of 0 with Avg based on Product/ Item Identifier
Impute Item weight with average weight based on Product/ Item Identifier
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Validating values have been Imputed as expected
Item_Visibility Values = 0 replaced with Avg
Item_Weight Values = Null replaced with Avg
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Removing duplicates
Modify Item Fat Content variable since same categories are listed with different names
AfterBefore
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Adding Extra Columns
Create a Broad Type category based on first two letters of Item Identifier
Modifying Item Fat Content again based on Non-Consumable Items
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Adding Extra Columns
Create an Outlet_Year variable to determine the number of years and Outlet has been in operation
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
•Gather data from
multiple sources
•Cleanse
•Format
•Blend
•Sample
Data Preparation
Adding Extra Columns
Create an Outlet_Year variable to determine the number of years and Outlet has been in operation
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
Since we know that we will be building a predictive model, we would need to prepare our data a
little bit further… We would need to:
• Code categorical variables into numeric values and apply One – Hot Coding
• Split the data back into train and test data sets
• Drop unnecessary columns
• Export and save a copy of modified datasets
•Develop
methodology
•Determine
Important variables
•Build model
•Assess model
Exploratory Analysis/ Modeling
Predictive Modeling using Machine Learning algorithms:
• Multiple Linear Regression Model: Linear approach to modelling the relationship between
a scalar response (or dependent variable) and one or more explanatory
variables (or independent variables). - Wikipedia
• Decision Tree Model: Decision tree models allow you to develop classification systems that
predict or classify future observations based on a set of decision rules. – IBM Knowledge
Center
• Random Forest Model: Builds multiple decision trees and merges them together to get a
more accurate and stable prediction. – Towardsdatascience.com
Case Study: Big Mart Sales Practice Problem
Source: Analytics Vidhya and AARSHAY JAIN Solution
Approach
References/ Images
Validation
Evaluating Model results
• Evaluate results
• Review process
• Determine next
steps:
 OK
 NOT OK Repeat
Go to Step 6
• Linear Regression Model: Linear regression model had low R-Squared (0.56) and high RMSE score (1128)
• Decision Tree Models: RMSE Score slightly improved (1091), starts to show what variables play a key role
in predicting sales
• Random Forest Models: Model improved compared to Decision Tree Model, RMSE Score (1068)
Orange
•Communicate
results
•Determine best
method/ graph to
present insights
based on analysis
and audience
•Craft a compelling
story
•Make
recommendations
Visualization & Presentation
Communicate results
• Share the model results by generating a report with results and predictions.
• If model is meant to be used repeatably, create documentation, train the user(s) in how to use
model.
Determine best method/ graph
The right chart type by Dr. Andrew
Abela
Other Resources:
Selecting the Right Chart type by
Stephen Few
Graph Selection Matrix by Stephen
Few
Craft a compelling Story and Make recommendations
• Can be done via a Dashboard
•Communicate
results
•Determine best
method/ graph to
present insights
based on analysis
and audience.
•Craft a compelling
story
•Make
recommendations
•Business
Objectives
•Information
needed
•Type of analysis
•Scope of work
•Deliverables
Business Issue
Understanding
•Initial Data
collection
•Data
requirements
•Data availability
•Data
exploration and
characteristics
Data
Understanding
•Gather data
from multiple
sources
•Cleanse
•Format
•Blend
•Sample
Data
Preparation
•Develop
methodology
•Determine
Important
variables
•Build model
•Assess model
Exploratory
Analysis/
Modeling
Visualization &
Presentation
• Evaluate
results
• Review
process
• Determine
next steps:
 OK
 NOT OK
Validation
Reference: Problem Solving with advanced analytics
Reference: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
Getting noticed in the Tableau Community
MakeoverMonday Blog
Dec 3rd 2017
Tableau Public Viz of the Day
Dec 5Th 2017
Tableau Public Featured Author
March 28th 2018
Tableau Public Profile Tour
Tips to learn more about Tableau and Data
Visualization
• Create a Tableau Public Account
• Create a Twitter Account, contribute and participate in:
• #MakeoverMonday
• #workoutwednesday
• #SWDchallenge
• #DataForACause
• #VizforSocialGood
• Follow:
• @tableau
• @tableaupublic
• @paulisDataViz
• @leveleducation
Books I recommend:
• The big book of Dashboards
• Story Telling with Data
Questions ??
@paulisDataViz | munoz.p@husky.neu.edu

Understanding the Lifecycle of a Data Analysis Project

  • 1.
    Understanding the lifecycle of a data analysis project & Visualizing your data with Tableau Paula Muñoz Telecommunications Professional |Data Enthusiast | Tableau Public Featured Author @paulisDataViz | munoz.p@husky.neu.edu
  • 2.
    1. Career pathand education 2. Life Cycle of a data analysis project based on CRISP- DM Methodology Walk through steps with a sample project and tools 3. Visualizing your data with Tableau 4. Q & A 1 2 34 5 My Career
  • 3.
    Paula Muñoz Telecommunications Professional|Data Enthusiast | Tableau Public Featured Author 1 Lifelong Learner References/ Images
  • 4.
    Diagram by KennethJensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610 Reference: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining Cross-Industry Standard Process for Data Mining "A data mining process model that describes commonly used approaches that data mining experts use to tackle problems” - Wikipedia Visualization & Presentation
  • 5.
    •Business Objectives •Information needed •Type of analysis •Scopeof work •Deliverables Business Issue Understanding “The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.” Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach Let’s practice with the Big Mart Sales Dataset!
  • 6.
    •Initial Data Collection •Data requirements •Data availability •Data exploration and characteristics Data Understanding VariableDescription Item_Identifier Unique product ID Item_Weight Weight of product Item_Fat_Content Whether the product is low fat or not Item_Visibility The % of total display area of all products in a store allocated to the particular product Item_Type The category to which the product belongs Item_MRP Maximum Retail Price (list price) of the product Outlet_Identifier Unique store ID Outlet_Establishment_Year The year in which store was established Outlet_Size The size of the store in terms of ground area covered Outlet_Location_Type The type of city in which the store is located Outlet_Type Whether the outlet is just a grocery store or some sort of supermarket Item_Outlet_Sales Sales of the product in the particular store. This is the outcome variable to be predicted. Per case study: “We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.” 12 Variables Dependent Variable Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 7.
    •Initial Data collection •Data requirements •Data availability •Data exploration and characteristics Data Explorationand data characteristics “We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.” Data Understanding Tools to explore and prepare data: Excel (for small datasets), Python, R, Alteryx, Tableau Prep, Tableau Desktop Python Tableau Prep Tableau Desktop Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 8.
  • 9.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Combining both “train” and “test” data into one dataset and adding a column to identify the source 13 fields, 14204 rows Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 10.
    Identifying missing values Ok,Target variable Data Understanding •Initial Data collection •Data requirements •Data availability •Data exploration and characteristics Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 11.
    Exploring Categorical variables(Dimensions) •Initial Data collection •Data needed •Data availability •Data exploration and characteristics Data Understanding Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 12.
    Exploring statistics fornumerical variables (Measures) •Initial Data collection •Data requirements •Data availability •Data exploration and characteristics Data Understanding Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 13.
    Summary of inferencesdrawn from variables on Step 2: •Initial Data collection •Data requirements •Data availability •Data exploration and characteristics Data Understanding • Item_Weight and Outlet_Size have a high number of “Nulls”/ Missing values • Item_Visibility shows some minimum values of 0, which doesn’t make sense as any item occupies some space in a store. • Item_Fat_Content has typos/ duplicates on category names Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 14.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation After having an understanding of the data, we will complete and apply the following methods to clean our data 1. Impute missing values: • Impute Outlet Size with the mode based on Outlet Type • Impute Item weight with average weight based on Product/ Item Identifier • Impute Item Visibility for Items with values of 0 with Average based on Product/ Item Identifier 2. Remove duplicates • Modify Item Fat Content variable to remove duplicates 3. Create additional variables • Create a Broad Type category based on first two letters of Item Identifier • Create an Outlet_Year variable to determine the number of years and Outlet has been in operation Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 15.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Imputing missing values Impute Outlet Size with the mode based Outlet Type Before After
  • 16.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Cleaning the data: Imputing other values Impute Item Visibility with the values of 0 with Avg based on Product/ Item Identifier Impute Item weight with average weight based on Product/ Item Identifier
  • 17.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Validating values have been Imputed as expected Item_Visibility Values = 0 replaced with Avg Item_Weight Values = Null replaced with Avg
  • 18.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Removing duplicates Modify Item Fat Content variable since same categories are listed with different names AfterBefore
  • 19.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Adding Extra Columns Create a Broad Type category based on first two letters of Item Identifier Modifying Item Fat Content again based on Non-Consumable Items
  • 20.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Adding Extra Columns Create an Outlet_Year variable to determine the number of years and Outlet has been in operation Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach
  • 21.
    •Gather data from multiplesources •Cleanse •Format •Blend •Sample Data Preparation Adding Extra Columns Create an Outlet_Year variable to determine the number of years and Outlet has been in operation Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach Since we know that we will be building a predictive model, we would need to prepare our data a little bit further… We would need to: • Code categorical variables into numeric values and apply One – Hot Coding • Split the data back into train and test data sets • Drop unnecessary columns • Export and save a copy of modified datasets
  • 22.
    •Develop methodology •Determine Important variables •Build model •Assessmodel Exploratory Analysis/ Modeling Predictive Modeling using Machine Learning algorithms: • Multiple Linear Regression Model: Linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). - Wikipedia • Decision Tree Model: Decision tree models allow you to develop classification systems that predict or classify future observations based on a set of decision rules. – IBM Knowledge Center • Random Forest Model: Builds multiple decision trees and merges them together to get a more accurate and stable prediction. – Towardsdatascience.com Case Study: Big Mart Sales Practice Problem Source: Analytics Vidhya and AARSHAY JAIN Solution Approach References/ Images
  • 23.
    Validation Evaluating Model results •Evaluate results • Review process • Determine next steps:  OK  NOT OK Repeat Go to Step 6 • Linear Regression Model: Linear regression model had low R-Squared (0.56) and high RMSE score (1128) • Decision Tree Models: RMSE Score slightly improved (1091), starts to show what variables play a key role in predicting sales • Random Forest Models: Model improved compared to Decision Tree Model, RMSE Score (1068) Orange
  • 24.
    •Communicate results •Determine best method/ graphto present insights based on analysis and audience •Craft a compelling story •Make recommendations Visualization & Presentation Communicate results • Share the model results by generating a report with results and predictions. • If model is meant to be used repeatably, create documentation, train the user(s) in how to use model. Determine best method/ graph The right chart type by Dr. Andrew Abela Other Resources: Selecting the Right Chart type by Stephen Few Graph Selection Matrix by Stephen Few Craft a compelling Story and Make recommendations • Can be done via a Dashboard
  • 25.
    •Communicate results •Determine best method/ graphto present insights based on analysis and audience. •Craft a compelling story •Make recommendations •Business Objectives •Information needed •Type of analysis •Scope of work •Deliverables Business Issue Understanding •Initial Data collection •Data requirements •Data availability •Data exploration and characteristics Data Understanding •Gather data from multiple sources •Cleanse •Format •Blend •Sample Data Preparation •Develop methodology •Determine Important variables •Build model •Assess model Exploratory Analysis/ Modeling Visualization & Presentation • Evaluate results • Review process • Determine next steps:  OK  NOT OK Validation Reference: Problem Solving with advanced analytics Reference: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
  • 26.
    Getting noticed inthe Tableau Community MakeoverMonday Blog Dec 3rd 2017 Tableau Public Viz of the Day Dec 5Th 2017 Tableau Public Featured Author March 28th 2018
  • 27.
  • 28.
    Tips to learnmore about Tableau and Data Visualization • Create a Tableau Public Account • Create a Twitter Account, contribute and participate in: • #MakeoverMonday • #workoutwednesday • #SWDchallenge • #DataForACause • #VizforSocialGood • Follow: • @tableau • @tableaupublic • @paulisDataViz • @leveleducation Books I recommend: • The big book of Dashboards • Story Telling with Data
  • 29.
    Questions ?? @paulisDataViz |munoz.p@husky.neu.edu

Editor's Notes

  • #7 Reviewing initial data collected and information provided
  • #11 Identifying missing values