SlideShare a Scribd company logo
Olist
A Brazilian E-commerce Company
APAN 5310 Project Team 1
Juno Zhu | Manasa Damera | Sarah Faye Wu | Yuhuan Su
Agenda
Background - Client Scenario & Data Overview
Database Normalization
ETL Process Optimization
Analytics Insights Automation - Benefits & Procedure
Dashboard Demo
Client Scenario: Powering Business Intelligence at Olist
Database
Normalization
Create a normalized
relational database
as a central data
repository to collect
data
ETL Process
Optimization
Conduct data
manipulation and
data cleaning via
Python; Upload the
data to Postgresql
database
Analytics
Insights
Automation
Generate analytical
insights through an
interactive
dashboard via
Metabase
Scattered data
storage through
multiple flat files
Inefficient
information query
process
Lack of analytics
insights to make
business decisions
Current
Pain Points
Reduce data
storage redundancy
Create efficient data
query and analytics
procedure
Empower data
driven decision
making capability
Future Impact
Original Data Sample
Repetitive columns
that should be
combined
Geolocation File Customer File
Too large file size
to be uploaded
into Codio
Data Overview
● Data consists of 100,000 orders from 2016 through 2018 placed by customers on Olist from several sellers located across Brazil
● 9 Flat CSV Files: Customers, Geolocation, Order Items, Order Payments, Order Reviews, Orders, Products, Sellers and Category File
● Total size 123.4 MB
● If we merge geolocation (61.3MB) with customers dataset (9MB)
to link each other, the customers dataset will be over 150MB.
● Thus, we sample these two datasets for further usage.
Underlying
duplicates difficult
to be detected
● Geolocation dataset has underlying “duplicates” which cannot be detected by using
“drop_duplicates()” function in Python, because the language might be different.
● In the reviews dataset, one review_id would link to different oder_id with different
information in other columns. (composite primary key: review_id, order_id)
Scattered data
storage through
multiple files
Different
languages across
different files
Products File
Orders File Order Items File
● Information about orders, delivery, product ordered is stored in separate files.
Normalization Plan: Creating an Optimized Data Schema
1st Normal Form
● Added primary keys such as geolocation_id to Address table
● Added foreign keys such as product_category_id to Product Category table
● Dropped duplicated data such as address data from Customers/Sellers tables
2nd Normal Form
● No changes on the tables as all non-key attributes were fully dependent
3rd Normal Form
● The Orders table was split into Orders and Delivery tables
Extract
ETL Process:
Transform Load
Uploading the
transformed data to
a centralized
repository in
PostgreSQL
database
Extracting
e-commerce data
from multiple CSV
flat files
Performing data
cleaning and
manipulation on
the extracted data
via Python
ETL - Transform Process Debrief
Step 1 Extract, rename, and reorder the columns
Step 2 Get the relevant information for each table by merging datasets
Step 3 Drop duplicated entries
Step 4
Ensure that the primary key column only includes unique values and
uniquely identifies each record in a table
Step 5 Construct the "id" variable if necessary
Step 6
If the table exists foreign keys, merge the current dataset with the
dataset referred to by this key to get the intersection. Drop
unnecessary columns and rename columns after merging.
Step 7
Change the data type of the variable of the raw dataset to stay
consistent with the column data type we designed.
New Customer Table
New Address Table
Analytical Procedures Benefits (WHY):
Customer Insights
CMO: Understand customers’ demographic info, shopping behavior and product preference to make
targeted marketing strategy. Identify customers’ cities distribution/ customer lifetime value/ top
categories/ peak purchase time/ number of customers by year and month.
Seller Insights
Client Account Executive: Understand sellers’ demographic info, sales performance and product
rank to inform sellers improve performance. Identify top sellers/ categories with highest growth.
Financials Insights
CFO: Analyze platform revenue and cost on a real-time pace to make quick decisions and identify
potential performance issues. Understand order value/ monthly and annual sales.
Operations Insights
COO: Oversee logistics performance and react timely when significant shipment delays occurred.
Monitor monthly on-time delivery rate performance.
Post Purchase Service Insights
Customer Service Executive: Review customer reviews metrics to ensure
a high-quality closed loop service. Analyze order review scores/ customer complaints.
Empower
C-level executives and
analysts
to understand
business performance
from a 360 degree view
Analytical Procedures Instructions (HOW):
C-level executives communicate key metrics
used to review each department’s performance
to the analysts.
Creation
Vision
Analysts build customized metrics for dashboard
by writing queries using both python and
postgreSQL on Metabase platform.
Action
C-level executives review the dashboard on a daily basis to
oversee business performance. Once they notice an issue such
as a drop in sales, they should inform analysts to perform
further analysis and make data-driven decisions.
Implementation
Analysts should seek feedback from the
executives to further improve the analytical
procedure by revising the metrics.
Further Considerations
● On-premises solution for sensitive and
personally identifiable customer data
● Anonymization of customer data for
cloud upload
● Offsite/cloud for less sensitive data and
anonymized customer data
Database Interaction Demo
http://35.237.178.81:3000/dashboard/1
Thank you!
Q&A
References
Data Sources:
1. Kaggle (Brazilian E-Commerce Public Dataset by Olist),
https://www.kaggle.com/olistbr/brazilian-ecommerce/home
2. Silberschatz, A., Korth, H. F., and Sudarshan, S. (2011). Database System Concepts (6th Edition). McGraw-Hill.
ISBN-13: 978-0073523323
Code - Data sampling [Link]
Code - Create database & Extract, Transform, Load in Python [Link]

More Related Content

What's hot

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Big Data Case Study on Walmart
Big Data Case Study on WalmartBig Data Case Study on Walmart
Big Data Case Study on WalmartJainamParikh3
 
How to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachHow to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachPrecisely
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance BigID Inc
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company PresentationAndrewJiang18
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
Introduction to Power BI
Introduction to Power BIIntroduction to Power BI
Introduction to Power BIHARIHARAN R
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data SolutionJames Serra
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Presentation on Business Intelligence (BI)
Presentation on Business Intelligence (BI)Presentation on Business Intelligence (BI)
Presentation on Business Intelligence (BI)AkashBorse2
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data martDavid Walker
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 

What's hot (20)

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Big Data Case Study on Walmart
Big Data Case Study on WalmartBig Data Case Study on Walmart
Big Data Case Study on Walmart
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
How to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachHow to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First Approach
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Tableau ppt
Tableau pptTableau ppt
Tableau ppt
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
Introduction to Power BI
Introduction to Power BIIntroduction to Power BI
Introduction to Power BI
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Power Bi Basics
Power Bi BasicsPower Bi Basics
Power Bi Basics
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Presentation on Business Intelligence (BI)
Presentation on Business Intelligence (BI)Presentation on Business Intelligence (BI)
Presentation on Business Intelligence (BI)
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data mart
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 

Similar to Team project - Data visualization on Olist company data

Business Intelligence: Data Warehouses
Business Intelligence: Data WarehousesBusiness Intelligence: Data Warehouses
Business Intelligence: Data WarehousesMichael Lamont
 
POS Data Quality: Overcoming a Lingering Retail Nightmare
POS Data Quality: Overcoming a Lingering Retail NightmarePOS Data Quality: Overcoming a Lingering Retail Nightmare
POS Data Quality: Overcoming a Lingering Retail NightmareCognizant
 
Bi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkBi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkSlava Kokaev
 
Excel Tips for the Time-Crunched Marketer
Excel Tips for the Time-Crunched MarketerExcel Tips for the Time-Crunched Marketer
Excel Tips for the Time-Crunched MarketerHanapin Marketing
 
Business Intelligence Challenges 2009
Business Intelligence Challenges 2009Business Intelligence Challenges 2009
Business Intelligence Challenges 2009Lonnell Branch
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouseUday Kothari
 
Datawarehouse Overview
Datawarehouse OverviewDatawarehouse Overview
Datawarehouse Overviewashok kumar
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
Building a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathyBuilding a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathySolmaz Shahalizadeh
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Data Alchemy Overview Presentation (Static Version)
Data Alchemy Overview Presentation (Static Version)Data Alchemy Overview Presentation (Static Version)
Data Alchemy Overview Presentation (Static Version)Mark Rubenstein
 
Data warehouse
Data warehouseData warehouse
Data warehouse_123_
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehousekiran14360
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligenceAhsan Kabir
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 

Similar to Team project - Data visualization on Olist company data (20)

Business Intelligence: Data Warehouses
Business Intelligence: Data WarehousesBusiness Intelligence: Data Warehouses
Business Intelligence: Data Warehouses
 
POS Data Quality: Overcoming a Lingering Retail Nightmare
POS Data Quality: Overcoming a Lingering Retail NightmarePOS Data Quality: Overcoming a Lingering Retail Nightmare
POS Data Quality: Overcoming a Lingering Retail Nightmare
 
Bi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkBi Architecture And Conceptual Framework
Bi Architecture And Conceptual Framework
 
Excel Tips for the Time-Crunched Marketer
Excel Tips for the Time-Crunched MarketerExcel Tips for the Time-Crunched Marketer
Excel Tips for the Time-Crunched Marketer
 
Business Intelligence Challenges 2009
Business Intelligence Challenges 2009Business Intelligence Challenges 2009
Business Intelligence Challenges 2009
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouse
 
Datawarehouse Overview
Datawarehouse OverviewDatawarehouse Overview
Datawarehouse Overview
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Orqubit Business Intelligence
Orqubit Business IntelligenceOrqubit Business Intelligence
Orqubit Business Intelligence
 
Group - 9 Final Deliverable
Group - 9 Final DeliverableGroup - 9 Final Deliverable
Group - 9 Final Deliverable
 
Building a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathyBuilding a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathy
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Data Alchemy Overview Presentation (Static Version)
Data Alchemy Overview Presentation (Static Version)Data Alchemy Overview Presentation (Static Version)
Data Alchemy Overview Presentation (Static Version)
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Big Data + PeopleSoft = BIG WIN!
Big Data + PeopleSoft = BIG WIN!Big Data + PeopleSoft = BIG WIN!
Big Data + PeopleSoft = BIG WIN!
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
 
Mli 2017 business mbi
Mli 2017 business mbiMli 2017 business mbi
Mli 2017 business mbi
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxbenishzehra469
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 

Team project - Data visualization on Olist company data

  • 1. Olist A Brazilian E-commerce Company APAN 5310 Project Team 1 Juno Zhu | Manasa Damera | Sarah Faye Wu | Yuhuan Su
  • 2. Agenda Background - Client Scenario & Data Overview Database Normalization ETL Process Optimization Analytics Insights Automation - Benefits & Procedure Dashboard Demo
  • 3. Client Scenario: Powering Business Intelligence at Olist Database Normalization Create a normalized relational database as a central data repository to collect data ETL Process Optimization Conduct data manipulation and data cleaning via Python; Upload the data to Postgresql database Analytics Insights Automation Generate analytical insights through an interactive dashboard via Metabase Scattered data storage through multiple flat files Inefficient information query process Lack of analytics insights to make business decisions Current Pain Points Reduce data storage redundancy Create efficient data query and analytics procedure Empower data driven decision making capability Future Impact
  • 4. Original Data Sample Repetitive columns that should be combined Geolocation File Customer File Too large file size to be uploaded into Codio Data Overview ● Data consists of 100,000 orders from 2016 through 2018 placed by customers on Olist from several sellers located across Brazil ● 9 Flat CSV Files: Customers, Geolocation, Order Items, Order Payments, Order Reviews, Orders, Products, Sellers and Category File ● Total size 123.4 MB ● If we merge geolocation (61.3MB) with customers dataset (9MB) to link each other, the customers dataset will be over 150MB. ● Thus, we sample these two datasets for further usage. Underlying duplicates difficult to be detected ● Geolocation dataset has underlying “duplicates” which cannot be detected by using “drop_duplicates()” function in Python, because the language might be different. ● In the reviews dataset, one review_id would link to different oder_id with different information in other columns. (composite primary key: review_id, order_id) Scattered data storage through multiple files Different languages across different files Products File Orders File Order Items File ● Information about orders, delivery, product ordered is stored in separate files.
  • 5. Normalization Plan: Creating an Optimized Data Schema 1st Normal Form ● Added primary keys such as geolocation_id to Address table ● Added foreign keys such as product_category_id to Product Category table ● Dropped duplicated data such as address data from Customers/Sellers tables 2nd Normal Form ● No changes on the tables as all non-key attributes were fully dependent 3rd Normal Form ● The Orders table was split into Orders and Delivery tables
  • 6. Extract ETL Process: Transform Load Uploading the transformed data to a centralized repository in PostgreSQL database Extracting e-commerce data from multiple CSV flat files Performing data cleaning and manipulation on the extracted data via Python
  • 7. ETL - Transform Process Debrief Step 1 Extract, rename, and reorder the columns Step 2 Get the relevant information for each table by merging datasets Step 3 Drop duplicated entries Step 4 Ensure that the primary key column only includes unique values and uniquely identifies each record in a table Step 5 Construct the "id" variable if necessary Step 6 If the table exists foreign keys, merge the current dataset with the dataset referred to by this key to get the intersection. Drop unnecessary columns and rename columns after merging. Step 7 Change the data type of the variable of the raw dataset to stay consistent with the column data type we designed. New Customer Table New Address Table
  • 8. Analytical Procedures Benefits (WHY): Customer Insights CMO: Understand customers’ demographic info, shopping behavior and product preference to make targeted marketing strategy. Identify customers’ cities distribution/ customer lifetime value/ top categories/ peak purchase time/ number of customers by year and month. Seller Insights Client Account Executive: Understand sellers’ demographic info, sales performance and product rank to inform sellers improve performance. Identify top sellers/ categories with highest growth. Financials Insights CFO: Analyze platform revenue and cost on a real-time pace to make quick decisions and identify potential performance issues. Understand order value/ monthly and annual sales. Operations Insights COO: Oversee logistics performance and react timely when significant shipment delays occurred. Monitor monthly on-time delivery rate performance. Post Purchase Service Insights Customer Service Executive: Review customer reviews metrics to ensure a high-quality closed loop service. Analyze order review scores/ customer complaints. Empower C-level executives and analysts to understand business performance from a 360 degree view
  • 9. Analytical Procedures Instructions (HOW): C-level executives communicate key metrics used to review each department’s performance to the analysts. Creation Vision Analysts build customized metrics for dashboard by writing queries using both python and postgreSQL on Metabase platform. Action C-level executives review the dashboard on a daily basis to oversee business performance. Once they notice an issue such as a drop in sales, they should inform analysts to perform further analysis and make data-driven decisions. Implementation Analysts should seek feedback from the executives to further improve the analytical procedure by revising the metrics. Further Considerations ● On-premises solution for sensitive and personally identifiable customer data ● Anonymization of customer data for cloud upload ● Offsite/cloud for less sensitive data and anonymized customer data
  • 12. References Data Sources: 1. Kaggle (Brazilian E-Commerce Public Dataset by Olist), https://www.kaggle.com/olistbr/brazilian-ecommerce/home 2. Silberschatz, A., Korth, H. F., and Sudarshan, S. (2011). Database System Concepts (6th Edition). McGraw-Hill. ISBN-13: 978-0073523323 Code - Data sampling [Link] Code - Create database & Extract, Transform, Load in Python [Link]