Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Data Storage Solutions for Data Analytics
1. B9DA102 Data Storage Solutions for Data Analytics
DATA STORAGE
ASSIGNMENT
Submitted by Trushita Prashant Redij
Student No: 10504099
2. 1
INDEX
Table of Contents
1.Introduction to Data Warehouse
1.1 Business Intelligence ..................................................................................................2
1.2 Data Warehouse ..........................................................................................................2
2. IOWA Liquor Sales Analysis
2.1 Why IOWA Liquor Sales Analysis.............................................................................3
2.2 Data Warehouse Visions and Goals............................................................................3
2.3 Source Data Table Description ............................................................................................3
2.4 Area of Analysis ..................................................................................................................4
2.5 Key Stakeholders .................................................................................................................4
2.6 Tools Used...................................................................................................................4
2.7 Data Warehouse Architecture......................................................................................5
3. Data Model of Data Warehouse
3.1 Designing Data warehouse...............................................................................................8
4. Data warehouse Implementation
4.1 Extract, Transform and Load..........................................................................................10
5.SSRS Reports and Visualization
5.1 SSRS Report...................................................................................................................13
5.2 Tableau Visualization ....................................................................................................16
6.XML, XSD and DTD………………………………………………………………...…....20
7. Neo4J and Cypher Queries
7.1 Relation between Nodes .................................................................................................25
7.2 Relation between Multiple Nodes..................................................................................25
8. Bibliography ........................................................................................................................29
3. 2
1. Introduction
In recent years, there has been a colossal upsurge in the business scenarios affecting today’s
marketplace which has fostered the need of data analysis and data warehousing in every
organization.
Data generated by organization is effectively used to analyse, extract, transform and derive
knowledgeable insights thereby influencing the organizations business decisions.
1.1 BUSINESS INTELLIGENCE:
Business Intelligence is an act of transforming raw data into useful information for business
analysis. BI based on Data Warehouse technology extracts information from a company’s
operational systems. The data is transformed (cleaned and integrated) and loaded into Data
Warehouses. Since Data is credible it is used for business insights.
1.2 DATA WAREHOUSE:
Data collected from various sources and stored in various databases cannot be directly
visualized. The data first needs to be integrated and processed before visualization take place.
Data warehouse is a central location where consolidated data from multiple locations
(databases) are stored. It’s maintained separately from an organizations operational database.
Data
Data
Data
INFORMATION
4. 3
2 Iowa Liquor Sales Analysis
2.1 Why IOWA Liquor Sales?
IOWA Liquor Sales data set was released by IOWA state government department of
commerce. It highlights the product details and date of purchase of spirits holding class
“E” Liquor licenses. The size of dataset is 3.2 GB thus can be used effectively for data
modelling.
There 55,564 rows capturing each purchase order. Each observation beholds 18 features
describing about the product, vendor, invoice and store.
2.2 Data Warehouse Vision and Goals
Vision
Utilize the accessible data and provide insights on Liquor sales in different cities, most
popular brands and types of liquor sold, price distribution and variance across different
stores.
Goals
Study trends of data and provide answers to strategic questions
To provide subject oriented analysis wherein data is categorized and stored by
business subject rather than by application.
Integrate data from disparate sources and store in single place
To ensure the data is non-volatile and time variant.
2.3 Source Data Tables Description
Invoice_Source: Contains details about the purchase order.
Category_Table: Illustrates the liquor brands.
Vendor_Table: Summarizes vendor details with vendor number
Store_Table: Describes store details like address, city, zip code, store number etc.
Item_Table: Contains details about Liquor products like number, ID.
5. 4
2.4 Area of Analysis
As Liquor brands and vendors are the key stakeholders of the Liquor business, it is prominent
for a store to assess the liquor sales. We can analyse and assess the liquor sales according to
the geographical locations of the stores, liquor brands purchased, amount of sales generated
in dollars, volume of liquor, availability of vendor to supply liquor bottles.
2.5 Key Stakeholders
Vendor: Team which deals with the supply of the liquor of different brands to the stores.
Store keeper: Update and records, and monitors stock of the Liquor bottles in the store
Customer: Purchases the Liquor from the store.
2.6 Tools Used
SQL Server Management Studio
SQL Server Integration Services
SQL Server Reporting Services
R Studio
Tableau
2.7 Data Warehouse Architecture
Approach used: Kimball’s bottom up approach
Advantages:
Data is easily available and performance is high since the source is an operational
system.
Dimensional modelling is used wherein the data is combined from multiple sources,
cleaned and normalized.
Faster execution of queries.
Use of Star schema helps is building agile, decentralized warehouse.
Data modelling using business intelligence tools assist users to access data easily.
6. 5
Steps used in dimensional modelling:
Selecting Business process
Declaring Grain
Identifying Dimensions
Designing Fact Table
Grain:
Amount of Liquor bottles sold by the store or purchased from the vendor which are
highlighted in the fact table as measurable unit. There are five dimension tables
corresponding to the measurable units mentioned in Fact table.
STAGING AREA: Data has been fetched from different data sources and cleaned wherein
the duplicate and null values are truncated or fixed.
Invoice
Category
Store
Vendor
Item
ETL Star
Schema/Cub
e
Business
Queries using
Tableau
Loading of Data
Cleaning of
Data
dimensional
modelling
Building Data
warehouse
Staging
Area
Data Visualization
IOWA LIQUOR SALES DATA WAREHOUSE ARCHITECTURE
7. 6
3. DATA MODEL of DATA WAREHOUSE
STAR SCHEMA: Schema is a logical description of entire database. It gives details about
constraints placed on the tables, the values present, how the key values are linked between
the different tables.
Star schema plays a fundamental role in data warehouse which contains fact table and
dimension table. Fact table contains all measurable values and foreign keys.
Dimension Table: Primary keys in dimension table are designated as foreign keys in fact
table. They describe about the details about business processes.
Advantages of Star Schema:
It’s a steady, recoverable and reinforce able approach to build OLAP cube.
Extraction, Transformation and loading can be performed swiftly.
Has minimum number of foreign keys and complexity.
Star Schema for IOWA Liquor Sales
8. 7
3.1 Designing the Data warehouse:
DIMENSION TABLES
ITEM_DIM:
This dimension contains product Id as primary key, product number and
product description. All the measurable values like item sold, item sold in
volume, number of bottles sold are mentioned in fact table.
CATEGORY_DIM:
This dimension consist the brand category ID as primary key, category
number and category name.
INVOICE_DIM:
This dimension contains the invoice Id, invoice Item no, invoice date,
bottle sold, sale is dollars, volume sold in litre, volume sold in gallons.
VENDOR_DIM:
This dimension consists of vendor Id(primary key), vendor number and
vendor name.
STORE_DIM:
This dimension contains store Id as primary key, store number and store
name
Dimension tables are interlinked and all the primary keys in dimension
tables are declared as foreign key in Fact table.
FACT TABLE
LIQUOR_FACT:
It comprises of records related to Invoice, Vendor, Item, Category and
Store that are maneuvered to process business queries.
It contains two types of attributes foreign keys and measures
9. 8
4. Data warehouse Implementation
Create STORE_DIM Table:
CREATE TABLE STORE_DIM
(
STORE_ID INT PRIMARY KEY NOT NULL,
STORE_NO INT,
STORE_NAME NVARCHAR(50),
ADRRESS NVARCHAR(50),
STORE_CITY NVARCHAR(50),
STORE_COUNTY NVARCHAR(50),
STORE_COUNTY_NO INT,
STORE_ZIP NVARCHAR(50)
)
Create CATEGORY_DIM Table:
CREATE TABLE CATEGORY_DIM
(
CATEGORY_ID INT PRIMARY KEY NOT NULL,
CATEGORY NVARCHAR(50),
CATEGORY_NAME NVARCHAR(50)
)
Create VENDOR_DIM Table:
CREATE TABLE VENDOR_DIM
(
VENDOR_ID INT PRIMARY KEY NOT NULL,
VENDOR_NO INT,
VENDOR_NAME NVARCHAR(50)
)
Create ITEM_DIM Table:
CREATE TABLE ITEM_DIM
(
ITEM_ID INT PRIMARY KEY NOT NULL,
ITEM_NO INT ,
ITEM_DESCRIPTION NVARCHAR(100),
CATEGORY_ID INT NOT NULL,
VENDOR_ID INT NOT NULL
)
10. 9
Create INVOICE_DIM Table:
CREATE TABLE INVOICE_DIM
(
INVOICE_ID INT PRIMARY KEY NOT NULL,
ITEM_ID INT NOT NULL,
STORE_ID INT NOT NULL,
INVOICE_ITEM_NO NVARCHAR(50),
INVOICE_DATE NVARCHAR(50) ,
BOTTLE_SOLD INT,
SALES_DOLLARS NVARCHAR(50),
VOL_SOLD_LITRES FLOAT,
VOL_SOLD_GALLONS FLOAT
)
Create LIQUOR_FACT Table:
CREATE TABLE LIQUOR_FACT
(
INVOICE_ID INT NOT NULL,
STORE_ID INT NOT NULL ,
VENDOR_ID INT NOT NULL,
ITEM_ID INT NOT NULL,
CATEGORY_ID INT NOT NULL,
PACK_SIZE INT NOT NULL,
TOTAL_BOTTLE_SOLD INT NOT NULL,
TOTAL_SALES NVARCHAR(50) NOT NULL ,
TOTAL_VOL_SOLD_LITRES NVARCHAR(50) NOT NULL
)
use lowa_liquor_sales_star
alter table LIQUOR_FACT add constraint FK_STORE_ID Foreign Key (STORE_ID) references
STORE_DIM (STORE_ID)
alter table LIQUOR_FACT add constraint FK_VENDOR_ID Foreign Key (VENDOR_ID) references
VENDOR_DIM (VENDOR_ID)
alter table LIQUOR_FACT add constraint FK_INVOICE_ID Foreign Key (INVOICE_ID) references
INVOICE_DIM (INVOICE_ID)
alter table LIQUOR_FACT add constraint FK_ITEM_ID Foreign Key (ITEM_ID) references ITEM_DIM
(ITEM_ID)
alter table LIQUOR_FACT add constraint FK_CATEGORY_ID Foreign Key (CATEGORY_ID) references
CATEGORY_DIM (CATEGORY_ID)
11. 10
4.1 Extract, Transform and Load
Data warehouse construction involves extraction of data from different sources,
transforming it and loading it onto data warehouse.
SSIS and SSMS are used to perform the ETL task.
Extraction:
Data has been extracted from source database containing five datasets and
lodged into the corresponding dimensions.
Source Database:
Iowa_Liquor_sales: This is the source database containing all the raw data of
Iowa Liquor sales which is loaded in five tables.
The extracted data has been cleaned and reconstructed into structured dataset
thereby serving as staging area prior to transformation and loading.
Transformation and Loading
In the process the data is converted into meaningful information.
All the dimension and fact table are populated with relevant data from staging
tables using SSIS tool.
Fact table is populated using the dimension tables. The primary key in
dimension table serves as foreign key in fact table thereby justifying our star
schema.
13. 12
STORE_DIM
CATEGORY_DIM
POPULATING FACT TABLE USING SQL QUERY
USE [lowa_liquor_sales_star]
GO
INSERT INTO [dbo].[LIQUOR_FACT]
([INVOICE_ID]
,[STORE_ID]
,[VENDOR_ID]
,[ITEM_ID]
,[CATEGORY_ID]
,[TOTAL_BOTTLE_SOLD]
,[TOTAL_SALES]
,[TOTAL_VOL_SOLD_LITRES])
select
i.INVOICE_ID,i.STORE_ID,v.VENDOR_ID,it.ITEM_ID,c.CATEGORY_ID,i.BOTTLE_SOLD
,i.SALES_DOLLARS,i.VOL_SOLD_LITRES
from dbo.INVOICE_DIM as i
join dbo.ITEM_DIM as it on i.ITEM_ID=it.ITEM_ID
join dbo.STORE_DIM as s on i.STORE_ID=s.STORE_ID
join dbo.VENDOR_DIM as v on it.VENDOR_ID=v.VENDOR_ID
join dbo.CATEGORY_DIM as c on it.CATEGORY_ID=c.CATEGORY_ID
14. 13
5. SSRS REPORTS and VISUALIZATION
5.1 SSRS Report
Analysis: Report depicts information about invoice generated with Invoice ID
on a particular date in Liquor store.
Query:
select a.INVOICE_ID, b.STORE_NAME, a.INVOICE_DATE,b.STORE_CITY
from INVOICE_DIM a, STORE_DIM b
where a.STORE_ID = b.STORE_ID
ORDER BY
b. STORE_CITY
15. 14
Analysis: Report illustrates the information about the product highlighting the
availability with the product vendor. It comprises Item Id, Vendor Id, Item
description, Vendor Name.
Query:
select a.ITEM_ID ,b.VENDOR_ID, a.ITEM_DESCRIPTION ,b.VENDOR_NAME
From ITEM_DIM a, VENDOR_DIM b
where a.ITEM_ID=b.ITEM_ID
order by
b.VENDOR_NAME
16. 15
Analysis:
The above report highlights the details about invoice generated comprising of
Invoice Id, Bottle sold, Volume in litre, Sales in dollars
17. 16
5.2 TABLEAU VISUALIZATION:
CASE STUDY OF LIQUOR VENDOR
The above packed bubble figure illustrates the vendors having largest
business in Iowa State.
The figure depicts that Diageo Americas are having maximum sales in
Iowa State.
Jim Beam Brands, Bacardi USA, Inc, Luxco St Loius, Sazerac
CO,Inc, Laird and Company are the strong competitors in the liquor
business.
It provides considerable input to vendors to develop and innovate new
strategies to elevate their business inventories in the Liquor industry
It also can assist to view the top ten vendors in the liquor industry in
the state of Iowa.
18. 17
CASE STUDY OF BOTTLES SOLD ACCORDING TO LIQOUR BRANDS
The above pie chart illustrates the amount of bottles sold according to the
corresponding liquor brands.
The chart clearly depicts that Vodka 80 is the highest selling liquor brand
in the city of Iowa.
Blended Whiskies, Spiced Rum, Canadian Whiskies are the prominent
competitors in the Liquor brands industry.
Also ,it gives brief overview of the customers preferences of Liquor
brands.
This will assist the vendors and store to strategize their business
inventory in order to
increase their sales in Liquor industry.
19. 18
BUSINESS CASE STUDY OF VENDOR AND SALES
The above horizontal bar graph illustrates the liquor sales for the
corresponding Liquor vendor.
It highlight’s the liquor brand selling highest amount of Liquor bottles for
a particular vendor.
For e.g. Tenessa Whiskes is the highest selling Liquor brand by Brown
Farmen Corporation.
The above illustration assists the vendors to strategize their business and
achieve profits by selling the brands that are in high demand.
20. 19
CASE STUDY OF SALES vs CATEGORY of LIQUOR BOTTLES
The above bar graph demonstrates the Sales of Liquor bottles according
to the corresponding brand names.
It gives the details about the bottles cost, Sales in dollars against the
category of the liquor brand.
The graph assists in briefing the vendor about the sales of the bottles
according to the brands in detail.
21. 20
6. eXtensible Markup Language, XSD and DTD
XML :
eXtensible markup language is abbreviated as XML.
It is used to store data in text format which helps in storing, sharing and
accessing data on different platforms.
Its widely used in web development and acts complimentary to HTML.
XSD:
It’s a formal representation to describe XML.
It can be used for verification of the documents by the programmers.
DTD:
DTD stands for Document Type Definition.
It comprises the structure, legal elements and attributes of XML document
25. 24
7. Neo4J and Cypher Queries
Step 1: Load all the .csv files onto Neo4j workspace and Index the files
Step 2: Create the relationship between the nodes
Upload Category Dimension File
load csv with headers from "file:///Category_dim.csv" as
row
create (CT:Category)
set
CT=row{CATEGORY_ID:row.CATEGORY_ID,CATEGORY_NAME:row.CATEGO
RY_Name}
return CT
26. 25
Upload Item Dimension File
Upload Invoice Dimension File:
Upload Liquor Fact File:
load csv with headers from "file:///Item_dim.csv" as
row
create (IT:Item)
set
IT=row{ITEM_ID:row.ITEM_ID,ITEM_NO:row.ITEM_NO,ITEM_DES
CRIPTION:row.ITEM_DESCRIPTION,CATEGORY_ID:row.CATEGORY_
ID,VENDOR_ID:row.VENDOR_ID}
return IT
load csv with headers from "file:///Invoice_dim.csv" as row
create (INV:Invoice)
set
INV=row{INVOICE_ID:row.INVOICE_ID,ITEM_ID:row.ITEM_ID,STORE_ID:ro
w.STORE_ID,INVOICE_ITEM_NO:row.INVOICE_ITEM_NO,
INVOICE_DATE:row.INVOICE_DATE,BOTTLE_SOLD:row.BOTTLE_SOLD,
SALES_DOLLARS:row.SALES_DOLLARS,VOL_SOLD_LITRES:row.VOL_SOLD_LITR
ES,
VOL_SOLD_GALLONS:row.VOL_SOLD_GALLONS}
return INV
load csv with headers from "file:///Liquor_fact_dim.csv" as row
create (FA:LiquorSale)
set
FA=row{TOTAL_SALES:row.TOTAL_SALES,CATEGORY_ID:row.CATEGORY_ID,IN
VOICE_ID:row.INVOICE_ID,ITEM_ID:row.ITEM_ID}
return FA
27. 26
7.1 RELATION BETWEEN NODES
Relationship between category and Liquor fact table
Relationship between category and Liquor fact table
MATCH t=()-[r:has_CATEGORY_ID]->() RETURN t LIMIT 25
MATCH t=()-[r:has_INVOICE_ID]->() RETURN t LIMIT 25
28. 27
7.2 Relation between all multiple nodes:
Explanation:
The above diagram illustrates the cardinality of relationship between
multiple nodes.
The above Neo 4 j graph depicts the relationship between
Category, Invoice and Liquor fact table.
The circle corresponds to the nodes and line corresponds to
relationship.
Category node is represented by orange colour, Invoice is
represented by blue colour and liquor fact node is represented by
red colour.
29. 28
Code for relationship between multiple nodes
Graph Database and Relational Database:
Graph database are indexed and consists of nodes and
relationships
Energy consumed is more as compared to relational databases.
Captured data can be changed with ease in graph database
Traversal is used to join relations in graphical database.
match(CT:Category),(FA:LiquorSale) where
CT.CATEGORY_ID=FA.CATEGORY_ID create(FA)
[r:has_CATEGORY_ID]->(CT) return CT,FA,r
match(a:Artist),(as:ArtistSale) where a.ArtistKey=as.ArtistKey
create(as)[r:has_artistkey]->(a) return a,as,r
match(INV:Invoice),(FA:LiquorSale) where
INV.INVOICE_ID=FA.INVOICE_ID create(FA)
[r:has_INVOICE_ID]->(INV) return INV,FA,r
match(FA:LiquorSale),(CT:Category), (INV:Invoice) where
CT.CATEGORY_ID=FA.CATEGORY_ID
AND INV.INVOICE_ID=FA.INVOICE_ID return CT,FA,INV
30. 29
8. Bibliography
Kimball, R. The Data Warehouse Toolkit. John Wiley, 1996
XML and XSD Validation link
http://www.utilitiesonline.info/xsdvalidation/#.XBqEuVz7TIV
Department of Commerce, IOWA State government.
Available at:https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-
qhgy