SlideShare a Scribd company logo
1 of 26
Build Data Warehouse for Retail using Hadoop Ecosystem
Let Data tell you its stories
Virtual Assistant for Retail Business
What happened?
Why did it happen?
What will happen?
What should I do?
Store A sales decreased 15% vs
same day last year.
Sales of key department decreased
20% due to key products out of stock.
Suppliers can’t deliver incoming orders of
these key products within this week. Sales
continues decreasing 18% next week.
Find & purchase the alternative products to
store. I sent email in detail to you.
Awesome! Thank you.
Virtual Assistant for Retail Business
What is Data Warehouse
❏ A Data Warehouse (DW) is a relational
database that is designed for query and
analysis rather than transaction processing.
❏ It includes historical data derived from
transaction data from single and multiple
sources.
❏ Data Warehouse is a subject-oriented,
integrated, and time-variant store of
information in support of management's
decisions.
Data Warehouse Models
❏ Data warehouse Architecture Basic
❏ Data warehouse Architecture with Staging Area
❏ Data warehouse Architecture with Staging Area and
Data Marts. This is the most common architecture of
data warehouse.
Data Warehouse Models
Vision of Data Warehouse in Retail Assistant
❏ 10,000 merchants (~100 TB)
❏ Real-time conversation with data
❏ Problems
❏ Performance
❏ Cost
❏ Scalability
Technology Solution
❏ Apache Hive
❏ Spark SQL
❏ Amazon Redshift
❏ Google BigQuery
Implement Hadoop for Retail Assistant
Infrastructure on AWS
❏ Understand the problem we need to solve
❏ Define datasource, data marts
❏ Design data model of staging, data warehouse & olap
❏ Build Data Pipeline
❏ Data validation
❏ System monitoring
6 Steps to build Data Warehouse
Metrics
❏ Gross Sales
❏ Net Sales
❏ Profit
❏ COGs
❏ Margin
❏ Sold Qty
❏ Transactions
❏ Sales by Square meter
❏ AVG Basket Value
❏ AVG Item Value
❏ Gross Sales Previous Year/Quarter/Month
❏ Net Sales Previous Year/Quarter/Month
❏ Profit Previous Year/Quarter/Month
Understand problem
Dimensions
❏ Date & Time
❏ Division
❏ Category
❏ Product Group
❏ Store
❏ Brand
❏ Vendor
❏ Size/ Color
❏ Season/ Style/ Collection
❏ Custom Group
❏ GEO
❏ Gender
❏ Staff
Design Data Model
Star Schema of Sales Data
Data Tables in Hive
Build Data Pipeline
❏ Collect : Data is extracted from on-
premise databases by using Apache
Sqoop. Then, it’s loaded to Hadoop
HDFS.
❏ Storage : Data is stored in its original
form in HDFS. It serves as an
immutable staging area for the data
warehouse.
❏ Process/Analyze : Data is transform,
load to data warehouse by using
Apache Hive. Then we use Apache
Kylin to olap data in data warehouse
to data marts
❏ Consume : Data is consumed by
users through different BI tools and
Google Assistant Chatbot
❏ Orchestrate : Data processes are
orchestrated by Oozie workflow and
monitor these workflows on HUE
Data Pipeline
Data Validation
❏ Design job validation on Talend
Big Data Studio.
❏ Build job as a jar file.
❏ Run and schedule, monitor
validation job as a Java action on
HUE
System Monitoring
System Monitoring
Problems
1
Performance Tuning
3
Data Validation
2
Incremental Loading
4
Cubing Optimization
Performance Tuning - Design Queues of process
Performance Tuning - Using ORC File Format
❏ Efficient compression: Stored as columns and
compressed, which leads to smaller disk reads. The
columnar format is also ideal for vectorization
optimizations in Tez
❏ Fast reads: ORC has a built-in index, min/max
values, and other aggregates that cause entire
stripes to be skipped during reads. In addition,
predicate pushdown pushes filters into reads so that
minimal rows are read. And Bloom filters further
reduce the number of rows that are returned.
❏ Proven in large-scale deployments: Facebook uses
the ORC file format for a 300+ PB deployment.
Performance Tuning - Using ORC File Format
❏ Time query : ~ 1/3
❏ Data storage : ~ 1/4
THANK YOU
Alex Nguyen
CTO, Product Manager.
E: alex@magestore.com
P: +84 93 792 9396
twitter.com/alexmagestore
Tommy Nguyen
Data Engineering
E: tommy@trueplus.com
P: +84 33 422 8033
fb.com/vocungphuphang

More Related Content

What's hot

Bigcommerce features list
Bigcommerce features listBigcommerce features list
Bigcommerce features listRasbor.com
 
Are you really in control of your business?
Are you really in control of your business?Are you really in control of your business?
Are you really in control of your business?StrongPoint Baltics
 
Pharmsoft Product Presentation
Pharmsoft Product PresentationPharmsoft Product Presentation
Pharmsoft Product PresentationAnn Magda leena
 
Compare 3 Accounting and Operational Systems: Distribution Capabilities
Compare 3 Accounting and Operational Systems: Distribution CapabilitiesCompare 3 Accounting and Operational Systems: Distribution Capabilities
Compare 3 Accounting and Operational Systems: Distribution CapabilitiesBlytheco
 
Magento Community & Enterprise: De voordelen voor jouw webshop
Magento Community & Enterprise: De voordelen voor jouw webshopMagento Community & Enterprise: De voordelen voor jouw webshop
Magento Community & Enterprise: De voordelen voor jouw webshopCopernica BV
 
Smart Sale Shop software presentation
Smart Sale Shop software presentationSmart Sale Shop software presentation
Smart Sale Shop software presentationStepan Aslanyan
 
Sage 300 ERP Operations Distribution Management
Sage 300 ERP Operations Distribution ManagementSage 300 ERP Operations Distribution Management
Sage 300 ERP Operations Distribution ManagementBurCom Consulting Ltd.
 
OrchestratedBEER Brewery Software Product Presentation
OrchestratedBEER Brewery Software Product PresentationOrchestratedBEER Brewery Software Product Presentation
OrchestratedBEER Brewery Software Product PresentationOrchestratedBEER
 
Brighterion Overview 2014
Brighterion Overview 2014Brighterion Overview 2014
Brighterion Overview 2014PYMNTS.com
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadAadhyaKrishnan
 

What's hot (12)

Bigcommerce features list
Bigcommerce features listBigcommerce features list
Bigcommerce features list
 
Are you really in control of your business?
Are you really in control of your business?Are you really in control of your business?
Are you really in control of your business?
 
Pharmsoft Product Presentation
Pharmsoft Product PresentationPharmsoft Product Presentation
Pharmsoft Product Presentation
 
Compare 3 Accounting and Operational Systems: Distribution Capabilities
Compare 3 Accounting and Operational Systems: Distribution CapabilitiesCompare 3 Accounting and Operational Systems: Distribution Capabilities
Compare 3 Accounting and Operational Systems: Distribution Capabilities
 
Magento Community & Enterprise: De voordelen voor jouw webshop
Magento Community & Enterprise: De voordelen voor jouw webshopMagento Community & Enterprise: De voordelen voor jouw webshop
Magento Community & Enterprise: De voordelen voor jouw webshop
 
It Presentation
It PresentationIt Presentation
It Presentation
 
Smart Sale Shop software presentation
Smart Sale Shop software presentationSmart Sale Shop software presentation
Smart Sale Shop software presentation
 
Sage 300 ERP Operations Distribution Management
Sage 300 ERP Operations Distribution ManagementSage 300 ERP Operations Distribution Management
Sage 300 ERP Operations Distribution Management
 
OrchestratedBEER Brewery Software Product Presentation
OrchestratedBEER Brewery Software Product PresentationOrchestratedBEER Brewery Software Product Presentation
OrchestratedBEER Brewery Software Product Presentation
 
Brighterion Overview 2014
Brighterion Overview 2014Brighterion Overview 2014
Brighterion Overview 2014
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in Hyderabad
 
Chapter9
Chapter9Chapter9
Chapter9
 

Similar to Build data warehouse for retail using Hadoop

Dw Concepts
Dw ConceptsDw Concepts
Dw Conceptsdataware
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelishuguk
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Data wharehousing and OLAP
Data wharehousing and OLAPData wharehousing and OLAP
Data wharehousing and OLAPAsma CHERIF
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 

Similar to Build data warehouse for retail using Hadoop (20)

Dw Concepts
Dw ConceptsDw Concepts
Dw Concepts
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Data wharehousing and OLAP
Data wharehousing and OLAPData wharehousing and OLAP
Data wharehousing and OLAP
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

Build data warehouse for retail using Hadoop

  • 1. Build Data Warehouse for Retail using Hadoop Ecosystem Let Data tell you its stories
  • 2. Virtual Assistant for Retail Business What happened? Why did it happen? What will happen? What should I do? Store A sales decreased 15% vs same day last year. Sales of key department decreased 20% due to key products out of stock. Suppliers can’t deliver incoming orders of these key products within this week. Sales continues decreasing 18% next week. Find & purchase the alternative products to store. I sent email in detail to you. Awesome! Thank you.
  • 3. Virtual Assistant for Retail Business
  • 4. What is Data Warehouse ❏ A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. ❏ It includes historical data derived from transaction data from single and multiple sources. ❏ Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's decisions.
  • 5. Data Warehouse Models ❏ Data warehouse Architecture Basic ❏ Data warehouse Architecture with Staging Area ❏ Data warehouse Architecture with Staging Area and Data Marts. This is the most common architecture of data warehouse.
  • 7. Vision of Data Warehouse in Retail Assistant ❏ 10,000 merchants (~100 TB) ❏ Real-time conversation with data ❏ Problems ❏ Performance ❏ Cost ❏ Scalability
  • 8. Technology Solution ❏ Apache Hive ❏ Spark SQL ❏ Amazon Redshift ❏ Google BigQuery
  • 9.
  • 10. Implement Hadoop for Retail Assistant
  • 12. ❏ Understand the problem we need to solve ❏ Define datasource, data marts ❏ Design data model of staging, data warehouse & olap ❏ Build Data Pipeline ❏ Data validation ❏ System monitoring 6 Steps to build Data Warehouse
  • 13. Metrics ❏ Gross Sales ❏ Net Sales ❏ Profit ❏ COGs ❏ Margin ❏ Sold Qty ❏ Transactions ❏ Sales by Square meter ❏ AVG Basket Value ❏ AVG Item Value ❏ Gross Sales Previous Year/Quarter/Month ❏ Net Sales Previous Year/Quarter/Month ❏ Profit Previous Year/Quarter/Month Understand problem Dimensions ❏ Date & Time ❏ Division ❏ Category ❏ Product Group ❏ Store ❏ Brand ❏ Vendor ❏ Size/ Color ❏ Season/ Style/ Collection ❏ Custom Group ❏ GEO ❏ Gender ❏ Staff
  • 15. Star Schema of Sales Data
  • 17. Build Data Pipeline ❏ Collect : Data is extracted from on- premise databases by using Apache Sqoop. Then, it’s loaded to Hadoop HDFS. ❏ Storage : Data is stored in its original form in HDFS. It serves as an immutable staging area for the data warehouse. ❏ Process/Analyze : Data is transform, load to data warehouse by using Apache Hive. Then we use Apache Kylin to olap data in data warehouse to data marts ❏ Consume : Data is consumed by users through different BI tools and Google Assistant Chatbot ❏ Orchestrate : Data processes are orchestrated by Oozie workflow and monitor these workflows on HUE
  • 19. Data Validation ❏ Design job validation on Talend Big Data Studio. ❏ Build job as a jar file. ❏ Run and schedule, monitor validation job as a Java action on HUE
  • 23. Performance Tuning - Design Queues of process
  • 24. Performance Tuning - Using ORC File Format ❏ Efficient compression: Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in Tez ❏ Fast reads: ORC has a built-in index, min/max values, and other aggregates that cause entire stripes to be skipped during reads. In addition, predicate pushdown pushes filters into reads so that minimal rows are read. And Bloom filters further reduce the number of rows that are returned. ❏ Proven in large-scale deployments: Facebook uses the ORC file format for a 300+ PB deployment.
  • 25. Performance Tuning - Using ORC File Format ❏ Time query : ~ 1/3 ❏ Data storage : ~ 1/4
  • 26. THANK YOU Alex Nguyen CTO, Product Manager. E: alex@magestore.com P: +84 93 792 9396 twitter.com/alexmagestore Tommy Nguyen Data Engineering E: tommy@trueplus.com P: +84 33 422 8033 fb.com/vocungphuphang