Data Warehouses and
Multi-Dimensional
Data Analysis
Raimonds Simanovskis
@rsim
Vampires
live here
500km
long beach
Other vampires
live here
(310.686 miles)
Data Warehouses and
Multi-Dimensional
Data Analysis
Raimonds Simanovskis
@rsim
Sales app example
class Customer < ActiveRecord::Base
has_many :orders
end
class Order < ActiveRecord::Base
belongs_to :customer
has_many :order_items
end
class OrderItem < ActiveRecord::Base
belongs_to :order
belongs_to :product
end
class Product < ActiveRecord::Base
belongs_to :product_class
has_many :order_items
end
class ProductClass < ActiveRecord::Base
has_many :products
end
Database schema
One day CEO asks
a question…
What were the

total sales amounts

in California

in Q1 2014

by product families?
Total sales amount …
OrderItem.sum("amount")
… in California …
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
sum("order_items.amount")
… in Q1 2014 …
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
sum("order_items.amount")
… by product families
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
sum("order_items.amount")
Generated SQL
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
sum("order_items.amount")
SELECT SUM(order_items.amount) AS sum_order_items_amount,
product_classes.product_family AS product_classes_product_family
FROM "order_items"
INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id"
INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id"
INNER JOIN "products" ON "products"."id" = "order_items"."product_id"
INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id"
WHERE "customers"."country" = 'USA'
AND "customers"."state_province" = 'CA'
AND (extract(YEAR FROM orders.order_date) = 2014)
AND (extract(quarter FROM orders.order_date) = 1)
GROUP BY product_classes.product_family
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost”).
map{|i| i.attributes.compact}
… and also

sales cost?
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS customers_count").
map{|i| i.attributes.compact}
… and unique

customers

count?
Is it
clear?
#@%$^&
OrderItem.joins(:order => :customer).
where("customers.country" => "USA",
"customers.state_province" => "CA").
where("extract(year from orders.order_date)
= ?", 2014).
where("extract(quarter from orders.order_date)
= ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS
customers_count").
map{|i| i.attributes.compact}
Performance slows down
on larger data volumes
$ rails console
>> OrderItem.count
(677.0ms) SELECT COUNT(*) FROM "order_items"
=> 6218022
>> Order.count
(126.0ms) SELECT COUNT(*) FROM "orders"
=> 642362
>> OrderItem.joins(:order => :customer).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS customers_count").
map{|i| i.attributes.compact}
OrderItem Load (25437.0ms) ...
6 million rows
25 seconds
You should
use NoSQL !
You should
use NoSQL !
Dimensional Modeling
Deliver data that’s
understandable to the business users
Deliver fast query performance
Dimensional Modeling
What were the

total sales amounts

in California

in Q1 2014

by product families?
fact or measure
Customer / Region dimension
Time dimension
Product dimension
Data Warehouse
“Star schema” with
fact and dimension tables
“Snowflake schema”
Data Warehouse Models
class Dwh::SalesFact < Dwh::Fact
belongs_to :customer, class_name: "Dwh::CustomerDimension"
belongs_to :product, class_name: "Dwh::ProductDimension"
belongs_to :time, class_name: "Dwh::TimeDimension"
end
class Dwh::CustomerDimension < Dwh::Dimension
has_many :sales_facts, class_name: “Dwh::SalesFact",
foreign_key: "customer_id"
end
class Dwh::ProductDimension < Dwh::Dimension
has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id"
belongs_to :product_class, class_name: "Dwh::ProductClassDimension"
end
class Dwh::ProductClassDimension < Dwh::Dimension
has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id"
end
class Dwh::TimeDimension < Dwh::Dimension
has_many :sales_facts, class_name: “Dwh::SalesFact",
foreign_key: "time_id"
end
Load Dimension
class Dwh::CustomerDimension < Dwh::Dimension
# ...
def self.truncate!
connection.execute "TRUNCATE TABLE #{table_name}"
end
def self.load!
truncate!
column_names = %w(id full_name city state_province country
birth_date gender created_at updated_at)
connection.insert %[
INSERT INTO #{table_name} (#{column_names.join(',')})
SELECT #{column_names.join(',')}
FROM #{::Customer.table_name}
]
end
end
Generate
Time
Dimension
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN
(SELECT date_value FROM #{table_name})
]).each do |date|
year, month, day = date.year, date.month, date.day
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}"
month_name = date.strftime("%b %Y")
day_name = date.strftime("%b %d %Y")
sql = send :sanitize_sql_array, [
%[
INSERT INTO #{table_name}
(id, date_value, year, quarter, month, day,
year_name, quarter_name, month_name, day_name)
VALUES
(?, ?, ?, ?, ?, ?,
?, ?, ?, ?)
],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
end
Load Facts
class Dwh::SalesFact < Dwh::Fact
def self.load!
truncate!
connection.insert %[
INSERT INTO #{table_name}
(customer_id, product_id, time_id,
sales_quantity, sales_amount, sales_cost)
SELECT
o.customer_id, oi.product_id,
CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER),
oi.quantity, oi.amount, oi.cost
FROM
#{OrderItem.table_name} oi
INNER JOIN #{Order.table_name} o ON o.id = oi.order_id
]
end
end
What were the

total sales amounts

in California

in Q1 2014

by product families?
Dwh::SalesFact.
joins(:customer).joins(:product => :product_class).joins(:time).
where("d_customers.country" => “USA",
"d_customers.state_province" => "CA").
where("d_time.year" => 2014, "d_time.quarter" => 1).
group("d_product_classes.product_family").
sum("sales_amount")
Two-Dimensional Table
CellRows
Columns
Multi-Dimensional Data Model
Dim
ensionDim
ension
Dimension
Measures
Data cube
Multi-Dimensional Data Model
Tim
e
Product
Customer
Measures

Sales quantity

Sales amount

Sales cost

Customers count
Sales cube
Dimension Hierarchies
All Customers
USA Canada
WA CA OR
San Francisco Los Angeles
Country
All
State
City
Levels
Time Dimension
All Times
2014 2015
Q2 Q3 Q4
AUG SEP
Year
All
Quarter
Month
AUG 01 AUG 02 Day
Q1
JUL
Default

hierarchy
All Times
2014 2015
W2 W3 W4
JAN 18 JAN 19
Year
All
Week
Day
W1
JAN 17
Weekly

hierarchy
OLAP Technologies
On-Line Analytical Processing
Mondrian
http://community.pentaho.com/projects/mondrian/
https://github.com/rsim/mondrian-olap
mondrian-olap gem
Mondrian::OLAP::Schema.define do
cube 'Sales' do
table 'f_sales', schema: 'dwh'
dimension 'Customer', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Customers', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Country', column: 'country'
level 'State Province', column: 'state_province'
level 'City', column: 'city'
level 'Name', column: 'full_name'
end
end
dimension 'Product', foreign_key: 'product_id' do
hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do
join left_key: 'product_class_id', right_key: 'id' do
table 'd_products', schema: 'dwh'
table 'd_product_classes', schema: 'dwh'
end
level 'Product Family', table: 'd_product_classes', column: 'product_family'
level 'Product Department', table: 'd_product_classes', column: 'product_department'
level 'Product Category', table: 'd_product_classes', column: 'product_category'
level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory'
level 'Brand Name', table: 'd_products', column: 'brand_name'
level 'Product Name', table: 'd_products', column: 'product_name'
end
end
dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do
hierarchy all_member_name: 'All Time', primary_key: 'id' do
table 'd_time', schema: 'dwh'
level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears'
level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters'
level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths'
level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays'
end
end
measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum'
measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum'
measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum'
measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count'
end
end
mondrian-olap
schema
definition
What were the

total sales amounts

in California

in Q1 2014

by product families?
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
MDX Query Language
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
SELECT {[Measures].[Sales Amount]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales]
WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
Results Caching
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost],
[Measures].[Customers Count]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales] (21713.0ms)
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost],
[Measures].[Customers Count]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales] (10.0ms)
Additional Attribute Dimension
dimension 'Gender', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Genders', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Gender', column: 'gender' do
name_expression do
sql "CASE d_customers.gender
WHEN 'F' THEN ‘Female'
WHEN 'M' THEN ‘Male'
END"
end
end
end
end
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Gender].[Gender].Members")
Dynamic Attribute Dimension
dimension 'Age interval', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Age', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Age interval' do
key_expression do
sql %[
CASE
WHEN age(d_customers.birth_date) < interval '20 years'
THEN '< 20 years'
WHEN age(d_customers.birth_date) < interval '30 years'
THEN '20-30 years'
WHEN age(d_customers.birth_date) < interval '40 years'
THEN '30-40 years'
WHEN age(d_customers.birth_date) < interval '50 years'
THEN '40-50 years'
ELSE '50+ years'
END
]
end
end
end
end
[Age interval].[<20 years]
[Age interval].[20-30 years]
[Age interval].[30-40 years]
[Age interval].[40-50 years]
[Age interval].[50+ years]
Calculation Formulas
calculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00',
formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]'
calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%',
formula: '[Measures].[Profit] / [Measures].[Sales Amount]'
olap.from("Sales").
columns("[Measures].[Profit]", "[Measures].[Margin %]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
Enables Ad-hoc Queries by Users
ETL process
Data

Warehouse
Measures
Dimension1 Dimension2
Dimension4Dimension3
Database
REST API
Extract Transform Load
Ruby Tools for ETL
Kiba http://www.kiba-etl.org/
https://github.com/square/ETLETL
Kiba example
# declare a ruby method here, for quick reusable logic
def parse_french_date(date)
Date.strptime(date, '%d/%m/%Y')
end
# or better, include a ruby file which loads reusable assets
# eg: commonly used sources / destinations / transforms, under unit-test
require_relative 'common'
# declare a source where to take data from (you implement it - see notes below)
source MyCsvSource, 'input.csv'
# declare a row transform to process a given field
transform do |row|
row[:birth_date] = parse_french_date(row[:birth_date])
# return to keep in the pipeline
row
end
# declare another row transform, dismissing rows conditionally by returning nil
transform do |row|
row[:birth_date].year < 2000 ? row : nil
end
# declare a row transform as a class, which can be tested properly
transform ComplianceCheckTransform, eula: 2015
Multithreaded ETL
https://github.com/ruby-concurrency/concurrent-ruby
Extract

ThreadPool
Transform

ThreadPool
Load

ThreadPool
Data
source
Extracted
data
Transformed
data
Pro-tip: Use
Single
threaded
ETL
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
logger.silence do
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date|
insert_date(date)
end
end
end
def self.insert_date(date)
year, month, day = date.year, date.month, date.day
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}"
month_name = date.strftime("%b %Y")
day_name = date.strftime("%b %d %Y")
sql = send :sanitize_sql_array, [
%[
INSERT INTO #{table_name}
(id, date_value, year, quarter, month, day,
year_name, quarter_name, month_name, day_name)
VALUES
(?, ?, ?, ?, ?, ?,
?, ?, ?, ?)
],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
require 'concurrent/executors'
class Dwh::TimeDimension < Dwh::Dimension
def self.parallel_load!(pool_size = 4)
logger.silence do
insert_date_pool = Concurrent::FixedThreadPool.new(pool_size)
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date|
insert_date_pool.post(date) do |date|
connection_pool.with_connection do
insert_date(date)
end
end
end
insert_date_pool.shutdown
insert_date_pool.wait_for_termination
end
end
end
ETL with
Thread Pool
Benchmark!
Dwh::TimeDimension.load! (5236.0ms)
Dwh::TimeDimension.parallel_load!(2) (3450.0ms)
Dwh::TimeDimension.parallel_load!(4) (2142.0ms)
Dwh::TimeDimension.parallel_load!(6) (2361.0ms)
Dwh::TimeDimension.parallel_load!(8) (2826.0ms)
optimal size
in this case
Java Mission Control
Traditional vs Analytical
Relational Databases
Optimized for
transaction processing
Optimized for
analytical queries
Row-based Storage
Columnar Storage
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
Analytical Query Performance
SELECT d_product_classes.product_family,
SUM(f_sales.sales_amount) AS sales_amount,
SUM(f_sales.sales_cost) AS sales_cost,
COUNT(DISTINCT f_sales.customer_id) AS customers_count
FROM "dwh"."f_sales"
INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" =
"dwh"."f_sales"."product_id"
INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" =
"dwh"."d_products"."product_class_id"
GROUP BY d_product_classes.product_family
always ~18 seconds
first ~9 seconds
next ~1.5 seconds
6 million rows
When to use what?
Fact table size
Traditional
transactional
databases
Analytical
columnar
databases
< 1M rows OK No big win
1-10M rows
Complex
queries slower
OK
10-100M rows Slow OK
>100M rows Very slow OK with tuning
What did we cover?
Problems with analytical queries
Dimensional modeling
Star schemas
Mondrian OLAP and MDX
ETL – Extract, Transform, Load
Analytical columnar databases
Questions?
raimonds.simanovskis@gmail.com
@rsim github.com/rsim
https://github.com/rsim/sales_app_demo

Data Warehouses and Multi-Dimensional Data Analysis

  • 1.
    Data Warehouses and Multi-Dimensional DataAnalysis Raimonds Simanovskis @rsim
  • 5.
  • 6.
  • 8.
    Data Warehouses and Multi-Dimensional DataAnalysis Raimonds Simanovskis @rsim
  • 9.
    Sales app example classCustomer < ActiveRecord::Base has_many :orders end class Order < ActiveRecord::Base belongs_to :customer has_many :order_items end class OrderItem < ActiveRecord::Base belongs_to :order belongs_to :product end class Product < ActiveRecord::Base belongs_to :product_class has_many :order_items end class ProductClass < ActiveRecord::Base has_many :products end
  • 10.
  • 11.
    One day CEOasks a question… What were the
 total sales amounts
 in California
 in Q1 2014
 by product families?
  • 12.
    Total sales amount… OrderItem.sum("amount")
  • 13.
    … in California… OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). sum("order_items.amount")
  • 14.
    … in Q12014 … OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). sum("order_items.amount")
  • 15.
    … by productfamilies OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
  • 16.
    Generated SQL OrderItem.joins(:order =>:customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount") SELECT SUM(order_items.amount) AS sum_order_items_amount, product_classes.product_family AS product_classes_product_family FROM "order_items" INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id" INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id" INNER JOIN "products" ON "products"."id" = "order_items"."product_id" INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id" WHERE "customers"."country" = 'USA' AND "customers"."state_province" = 'CA' AND (extract(YEAR FROM orders.order_date) = 2014) AND (extract(quarter FROM orders.order_date) = 1) GROUP BY product_classes.product_family
  • 17.
    OrderItem.joins(:order => :customer). where("customers.country"=> "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost”). map{|i| i.attributes.compact} … and also
 sales cost?
  • 18.
    OrderItem.joins(:order => :customer). where("customers.country"=> "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact} … and unique
 customers
 count?
  • 19.
    Is it clear? #@%$^& OrderItem.joins(:order =>:customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
  • 20.
    Performance slows down onlarger data volumes $ rails console >> OrderItem.count (677.0ms) SELECT COUNT(*) FROM "order_items" => 6218022 >> Order.count (126.0ms) SELECT COUNT(*) FROM "orders" => 642362 >> OrderItem.joins(:order => :customer). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact} OrderItem Load (25437.0ms) ... 6 million rows 25 seconds
  • 21.
  • 22.
  • 23.
    Dimensional Modeling Deliver datathat’s understandable to the business users Deliver fast query performance
  • 24.
    Dimensional Modeling What werethe
 total sales amounts
 in California
 in Q1 2014
 by product families? fact or measure Customer / Region dimension Time dimension Product dimension
  • 25.
    Data Warehouse “Star schema”with fact and dimension tables
  • 26.
  • 27.
    Data Warehouse Models classDwh::SalesFact < Dwh::Fact belongs_to :customer, class_name: "Dwh::CustomerDimension" belongs_to :product, class_name: "Dwh::ProductDimension" belongs_to :time, class_name: "Dwh::TimeDimension" end class Dwh::CustomerDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "customer_id" end class Dwh::ProductDimension < Dwh::Dimension has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id" belongs_to :product_class, class_name: "Dwh::ProductClassDimension" end class Dwh::ProductClassDimension < Dwh::Dimension has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id" end class Dwh::TimeDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "time_id" end
  • 28.
    Load Dimension class Dwh::CustomerDimension< Dwh::Dimension # ... def self.truncate! connection.execute "TRUNCATE TABLE #{table_name}" end def self.load! truncate! column_names = %w(id full_name city state_province country birth_date gender created_at updated_at) connection.insert %[ INSERT INTO #{table_name} (#{column_names.join(',')}) SELECT #{column_names.join(',')} FROM #{::Customer.table_name} ] end end
  • 29.
    Generate Time Dimension class Dwh::TimeDimension <Dwh::Dimension def self.load! connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end end
  • 30.
    Load Facts class Dwh::SalesFact< Dwh::Fact def self.load! truncate! connection.insert %[ INSERT INTO #{table_name} (customer_id, product_id, time_id, sales_quantity, sales_amount, sales_cost) SELECT o.customer_id, oi.product_id, CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER), oi.quantity, oi.amount, oi.cost FROM #{OrderItem.table_name} oi INNER JOIN #{Order.table_name} o ON o.id = oi.order_id ] end end
  • 31.
    What were the
 totalsales amounts
 in California
 in Q1 2014
 by product families? Dwh::SalesFact. joins(:customer).joins(:product => :product_class).joins(:time). where("d_customers.country" => “USA", "d_customers.state_province" => "CA"). where("d_time.year" => 2014, "d_time.quarter" => 1). group("d_product_classes.product_family"). sum("sales_amount")
  • 32.
  • 33.
  • 34.
    Multi-Dimensional Data Model Tim e Product Customer Measures Salesquantity Sales amount Sales cost Customers count Sales cube
  • 35.
    Dimension Hierarchies All Customers USACanada WA CA OR San Francisco Los Angeles Country All State City Levels
  • 36.
    Time Dimension All Times 20142015 Q2 Q3 Q4 AUG SEP Year All Quarter Month AUG 01 AUG 02 Day Q1 JUL Default hierarchy All Times 2014 2015 W2 W3 W4 JAN 18 JAN 19 Year All Week Day W1 JAN 17 Weekly hierarchy
  • 37.
    OLAP Technologies On-Line AnalyticalProcessing Mondrian http://community.pentaho.com/projects/mondrian/ https://github.com/rsim/mondrian-olap mondrian-olap gem
  • 38.
    Mondrian::OLAP::Schema.define do cube 'Sales'do table 'f_sales', schema: 'dwh' dimension 'Customer', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Customers', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Country', column: 'country' level 'State Province', column: 'state_province' level 'City', column: 'city' level 'Name', column: 'full_name' end end dimension 'Product', foreign_key: 'product_id' do hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do join left_key: 'product_class_id', right_key: 'id' do table 'd_products', schema: 'dwh' table 'd_product_classes', schema: 'dwh' end level 'Product Family', table: 'd_product_classes', column: 'product_family' level 'Product Department', table: 'd_product_classes', column: 'product_department' level 'Product Category', table: 'd_product_classes', column: 'product_category' level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory' level 'Brand Name', table: 'd_products', column: 'brand_name' level 'Product Name', table: 'd_products', column: 'product_name' end end dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do hierarchy all_member_name: 'All Time', primary_key: 'id' do table 'd_time', schema: 'dwh' level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears' level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters' level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths' level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays' end end measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum' measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum' measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum' measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count' end end mondrian-olap schema definition
  • 39.
    What were the
 totalsales amounts
 in California
 in Q1 2014
 by product families? olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
  • 40.
    MDX Query Language olap.from("Sales"). columns("[Measures].[SalesAmount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]") SELECT {[Measures].[Sales Amount]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
  • 41.
    Results Caching SELECT {[Measures].[SalesAmount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (21713.0ms) SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (10.0ms)
  • 42.
    Additional Attribute Dimension dimension'Gender', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Genders', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Gender', column: 'gender' do name_expression do sql "CASE d_customers.gender WHEN 'F' THEN ‘Female' WHEN 'M' THEN ‘Male' END" end end end end olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Gender].[Gender].Members")
  • 43.
    Dynamic Attribute Dimension dimension'Age interval', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Age', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Age interval' do key_expression do sql %[ CASE WHEN age(d_customers.birth_date) < interval '20 years' THEN '< 20 years' WHEN age(d_customers.birth_date) < interval '30 years' THEN '20-30 years' WHEN age(d_customers.birth_date) < interval '40 years' THEN '30-40 years' WHEN age(d_customers.birth_date) < interval '50 years' THEN '40-50 years' ELSE '50+ years' END ] end end end end [Age interval].[<20 years] [Age interval].[20-30 years] [Age interval].[30-40 years] [Age interval].[40-50 years] [Age interval].[50+ years]
  • 44.
    Calculation Formulas calculated_member 'Profit',dimension: 'Measures', format_string: '#,##0.00', formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]' calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%', formula: '[Measures].[Profit] / [Measures].[Sales Amount]' olap.from("Sales"). columns("[Measures].[Profit]", "[Measures].[Margin %]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
  • 45.
  • 46.
  • 47.
    Ruby Tools forETL Kiba http://www.kiba-etl.org/ https://github.com/square/ETLETL
  • 48.
    Kiba example # declarea ruby method here, for quick reusable logic def parse_french_date(date) Date.strptime(date, '%d/%m/%Y') end # or better, include a ruby file which loads reusable assets # eg: commonly used sources / destinations / transforms, under unit-test require_relative 'common' # declare a source where to take data from (you implement it - see notes below) source MyCsvSource, 'input.csv' # declare a row transform to process a given field transform do |row| row[:birth_date] = parse_french_date(row[:birth_date]) # return to keep in the pipeline row end # declare another row transform, dismissing rows conditionally by returning nil transform do |row| row[:birth_date].year < 2000 ? row : nil end # declare a row transform as a class, which can be tested properly transform ComplianceCheckTransform, eula: 2015
  • 49.
  • 50.
    Single threaded ETL class Dwh::TimeDimension <Dwh::Dimension def self.load! logger.silence do connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date(date) end end end def self.insert_date(date) year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end
  • 51.
    require 'concurrent/executors' class Dwh::TimeDimension< Dwh::Dimension def self.parallel_load!(pool_size = 4) logger.silence do insert_date_pool = Concurrent::FixedThreadPool.new(pool_size) connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date_pool.post(date) do |date| connection_pool.with_connection do insert_date(date) end end end insert_date_pool.shutdown insert_date_pool.wait_for_termination end end end ETL with Thread Pool
  • 52.
    Benchmark! Dwh::TimeDimension.load! (5236.0ms) Dwh::TimeDimension.parallel_load!(2) (3450.0ms) Dwh::TimeDimension.parallel_load!(4)(2142.0ms) Dwh::TimeDimension.parallel_load!(6) (2361.0ms) Dwh::TimeDimension.parallel_load!(8) (2826.0ms) optimal size in this case Java Mission Control
  • 53.
    Traditional vs Analytical RelationalDatabases Optimized for transaction processing Optimized for analytical queries
  • 54.
  • 55.
  • 56.
    Analytical Query Performance SELECTd_product_classes.product_family, SUM(f_sales.sales_amount) AS sales_amount, SUM(f_sales.sales_cost) AS sales_cost, COUNT(DISTINCT f_sales.customer_id) AS customers_count FROM "dwh"."f_sales" INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" = "dwh"."f_sales"."product_id" INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" = "dwh"."d_products"."product_class_id" GROUP BY d_product_classes.product_family always ~18 seconds first ~9 seconds next ~1.5 seconds 6 million rows
  • 57.
    When to usewhat? Fact table size Traditional transactional databases Analytical columnar databases < 1M rows OK No big win 1-10M rows Complex queries slower OK 10-100M rows Slow OK >100M rows Very slow OK with tuning
  • 58.
    What did wecover? Problems with analytical queries Dimensional modeling Star schemas Mondrian OLAP and MDX ETL – Extract, Transform, Load Analytical columnar databases
  • 59.