Multidimensional
 Data Analysis
  with JRuby
   Raimonds Simanovskis
      github.com/rsim
           @rsim
Relational
data model
SQL is good for detailed
       data queries
           Get all sales transactions in
           USA, California
SELECT customers.fullname, products.product_name,
  sales.sales_date, sales.unit_sales, sales.store_sales
FROM sales
  LEFT JOIN products ON sales.product_id = products.id
  LEFT JOIN customers ON sales.customer_id = customers.id
WHERE customers.country = 'USA' AND customers.state_province = 'CA'
SQL becomes complex
       for analytical queries
           Get total sales in USA, California
           in Q1, 2011 by main product groups

SELECT product_class.product_family,
       SUM(sales.unit_sales) unit_sales_sum,
       SUM(sales.store_sales) store_sales_sum
    FROM sales
      LEFT JOIN product ON sales.product_id = product.product_id
      LEFT JOIN product_class
           ON product.product_class_id = product_class.product_class_id
      LEFT JOIN time_by_day ON sales.time_id = time_by_day.time_id
      LEFT JOIN customer ON sales.customer_id = customer.customer_id
    WHERE time_by_day.the_year = 2011 AND time_by_day.quarter = 'Q1'
      AND customer.country = 'USA' AND customer.state_province = 'CA'
    GROUP BY product_class.product_family
Maybe write distributed
map reduce function?
Multidimensional
      Data Model
Multidimensional cubes

     Dimensions
Hierarchies and levels

      Measures
OLAP technologies
  On-Line Analytical Processing
http://github.com/rsim/mondrian-olap
MDX query language
          Get total units sold and sales amount
          in USA, California in Q1, 2011
          by main product groups


SELECT {[Measures].[Unit Sales], [Measures].[Store Sales]} ON COLUMNS,
       [Product].children ON ROWS
FROM   [Sales]
WHERE ( [Time].[2011].[Q1], [Customers].[USA].[CA] )
Or in Ruby like this
       Get total units sold and sales amount
       in USA, California in Q1, 2011
       by main product groups

olap.from('Sales').
columns('[Measures].[Unit Sales]',
        '[Measures].[Store Sales]').
rows('[Product].children').
where('[Time].[2011].[Q1]', '[Customers].[USA].[CA]').
execute
Also more complex
                queries
           Get sales amount and profit %
           of top 50 products sold in USA and Canada
           during Q1, 2011

olap.from('Sales').
with_member('[Measures].[ProfitPct]').
  as('(Measures.[Store Sales] - Measures.[Store Cost]) / Measures.[Store Sales]',
  :format_string => 'Percent').
columns('[Measures].[Store Sales]', '[Measures].[ProfitPct]').
rows('[Product].children').crossjoin('[Customers].[Canada]', '[Customers].[USA]').
  top_count(50, '[Measures].[Store Sales]')
where('[Time].[2011].[Q1]').
execute
OLAP schema
            (mapping cube to tables)
schema = Mondrian::OLAP::Schema.define do
  cube 'Sales' do
    table 'sales'
    dimension 'Gender', :foreign_key => 'customer_id' do
      hierarchy :has_all => true, :primary_key => 'customer_id' do
        table 'customer'
        level 'Gender', :column => 'gender', :unique_members => true
      end
    end
    dimension 'Time', :foreign_key => 'time_id' do
      hierarchy :has_all => false, :primary_key => 'time_id' do
        table 'time_by_day'
        level 'Year', :column => 'the_year', :type => 'Numeric', :unique_members => true
        level 'Quarter', :column => 'quarter', :unique_members => false
        level 'Month',:column => 'month_of_year',:type => 'Numeric',:unique_members => false
      end
    end
    measure 'Unit Sales', :column => 'unit_sales', :aggregator => 'sum'
    measure 'Store Sales', :column => 'store_sales', :aggregator => 'sum'
  end
end
mondrian-olap gem
   eazybi.com

Multidimensional Data Analysis with JRuby

  • 1.
    Multidimensional Data Analysis with JRuby Raimonds Simanovskis github.com/rsim @rsim
  • 2.
  • 3.
    SQL is goodfor detailed data queries Get all sales transactions in USA, California SELECT customers.fullname, products.product_name, sales.sales_date, sales.unit_sales, sales.store_sales FROM sales LEFT JOIN products ON sales.product_id = products.id LEFT JOIN customers ON sales.customer_id = customers.id WHERE customers.country = 'USA' AND customers.state_province = 'CA'
  • 4.
    SQL becomes complex for analytical queries Get total sales in USA, California in Q1, 2011 by main product groups SELECT product_class.product_family, SUM(sales.unit_sales) unit_sales_sum, SUM(sales.store_sales) store_sales_sum FROM sales LEFT JOIN product ON sales.product_id = product.product_id LEFT JOIN product_class ON product.product_class_id = product_class.product_class_id LEFT JOIN time_by_day ON sales.time_id = time_by_day.time_id LEFT JOIN customer ON sales.customer_id = customer.customer_id WHERE time_by_day.the_year = 2011 AND time_by_day.quarter = 'Q1' AND customer.country = 'USA' AND customer.state_province = 'CA' GROUP BY product_class.product_family
  • 5.
  • 6.
    Multidimensional Data Model Multidimensional cubes Dimensions Hierarchies and levels Measures
  • 7.
    OLAP technologies On-Line Analytical Processing
  • 8.
  • 9.
    MDX query language Get total units sold and sales amount in USA, California in Q1, 2011 by main product groups SELECT {[Measures].[Unit Sales], [Measures].[Store Sales]} ON COLUMNS, [Product].children ON ROWS FROM [Sales] WHERE ( [Time].[2011].[Q1], [Customers].[USA].[CA] )
  • 10.
    Or in Rubylike this Get total units sold and sales amount in USA, California in Q1, 2011 by main product groups olap.from('Sales'). columns('[Measures].[Unit Sales]', '[Measures].[Store Sales]'). rows('[Product].children'). where('[Time].[2011].[Q1]', '[Customers].[USA].[CA]'). execute
  • 11.
    Also more complex queries Get sales amount and profit % of top 50 products sold in USA and Canada during Q1, 2011 olap.from('Sales'). with_member('[Measures].[ProfitPct]'). as('(Measures.[Store Sales] - Measures.[Store Cost]) / Measures.[Store Sales]', :format_string => 'Percent'). columns('[Measures].[Store Sales]', '[Measures].[ProfitPct]'). rows('[Product].children').crossjoin('[Customers].[Canada]', '[Customers].[USA]'). top_count(50, '[Measures].[Store Sales]') where('[Time].[2011].[Q1]'). execute
  • 12.
    OLAP schema (mapping cube to tables) schema = Mondrian::OLAP::Schema.define do cube 'Sales' do table 'sales' dimension 'Gender', :foreign_key => 'customer_id' do hierarchy :has_all => true, :primary_key => 'customer_id' do table 'customer' level 'Gender', :column => 'gender', :unique_members => true end end dimension 'Time', :foreign_key => 'time_id' do hierarchy :has_all => false, :primary_key => 'time_id' do table 'time_by_day' level 'Year', :column => 'the_year', :type => 'Numeric', :unique_members => true level 'Quarter', :column => 'quarter', :unique_members => false level 'Month',:column => 'month_of_year',:type => 'Numeric',:unique_members => false end end measure 'Unit Sales', :column => 'unit_sales', :aggregator => 'sum' measure 'Store Sales', :column => 'store_sales', :aggregator => 'sum' end end
  • 13.