9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Build data warehouse for retail using Hadoop
1. Build Data Warehouse for Retail using Hadoop Ecosystem
Let Data tell you its stories
2. Virtual Assistant for Retail Business
What happened?
Why did it happen?
What will happen?
What should I do?
Store A sales decreased 15% vs
same day last year.
Sales of key department decreased
20% due to key products out of stock.
Suppliers can’t deliver incoming orders of
these key products within this week. Sales
continues decreasing 18% next week.
Find & purchase the alternative products to
store. I sent email in detail to you.
Awesome! Thank you.
4. What is Data Warehouse
❏ A Data Warehouse (DW) is a relational
database that is designed for query and
analysis rather than transaction processing.
❏ It includes historical data derived from
transaction data from single and multiple
sources.
❏ Data Warehouse is a subject-oriented,
integrated, and time-variant store of
information in support of management's
decisions.
5. Data Warehouse Models
❏ Data warehouse Architecture Basic
❏ Data warehouse Architecture with Staging Area
❏ Data warehouse Architecture with Staging Area and
Data Marts. This is the most common architecture of
data warehouse.
12. ❏ Understand the problem we need to solve
❏ Define datasource, data marts
❏ Design data model of staging, data warehouse & olap
❏ Build Data Pipeline
❏ Data validation
❏ System monitoring
6 Steps to build Data Warehouse
13. Metrics
❏ Gross Sales
❏ Net Sales
❏ Profit
❏ COGs
❏ Margin
❏ Sold Qty
❏ Transactions
❏ Sales by Square meter
❏ AVG Basket Value
❏ AVG Item Value
❏ Gross Sales Previous Year/Quarter/Month
❏ Net Sales Previous Year/Quarter/Month
❏ Profit Previous Year/Quarter/Month
Understand problem
Dimensions
❏ Date & Time
❏ Division
❏ Category
❏ Product Group
❏ Store
❏ Brand
❏ Vendor
❏ Size/ Color
❏ Season/ Style/ Collection
❏ Custom Group
❏ GEO
❏ Gender
❏ Staff
17. Build Data Pipeline
❏ Collect : Data is extracted from on-
premise databases by using Apache
Sqoop. Then, it’s loaded to Hadoop
HDFS.
❏ Storage : Data is stored in its original
form in HDFS. It serves as an
immutable staging area for the data
warehouse.
❏ Process/Analyze : Data is transform,
load to data warehouse by using
Apache Hive. Then we use Apache
Kylin to olap data in data warehouse
to data marts
❏ Consume : Data is consumed by
users through different BI tools and
Google Assistant Chatbot
❏ Orchestrate : Data processes are
orchestrated by Oozie workflow and
monitor these workflows on HUE
19. Data Validation
❏ Design job validation on Talend
Big Data Studio.
❏ Build job as a jar file.
❏ Run and schedule, monitor
validation job as a Java action on
HUE
24. Performance Tuning - Using ORC File Format
❏ Efficient compression: Stored as columns and
compressed, which leads to smaller disk reads. The
columnar format is also ideal for vectorization
optimizations in Tez
❏ Fast reads: ORC has a built-in index, min/max
values, and other aggregates that cause entire
stripes to be skipped during reads. In addition,
predicate pushdown pushes filters into reads so that
minimal rows are read. And Bloom filters further
reduce the number of rows that are returned.
❏ Proven in large-scale deployments: Facebook uses
the ORC file format for a 300+ PB deployment.
25. Performance Tuning - Using ORC File Format
❏ Time query : ~ 1/3
❏ Data storage : ~ 1/4
26. THANK YOU
Alex Nguyen
CTO, Product Manager.
E: alex@magestore.com
P: +84 93 792 9396
twitter.com/alexmagestore
Tommy Nguyen
Data Engineering
E: tommy@trueplus.com
P: +84 33 422 8033
fb.com/vocungphuphang