Designing Fast Data Architecture
for Big Data using Logical Data
Warehouse and Data Lakes
A Case Study presented by Kurt Jackson
Platform Lead, Autodesk
Kurt Jackson
Platform Lead
Ravi Shankar
Chief Marketing
Officer
Speakers
Agenda
1. Towards a Logical Data Lake – An Autodesk Case Study
2. Performance Considerations in Logical Data
Warehouse/ Lakes
3. Q&A and Next Steps
© 2016 Autodesk | Enterprise Information Services
Towards the Logical Data Lake
Kurt Jackson
Platform Lead
© 2016 Autodesk | Enterprise Information Services 5
What is a Logical Data Warehouse?
 A logical data warehouse is a data system that follows the ideas of
traditional EDW (star or snowflake schemas) and includes, in
addition to one (or more) core DWs, data from external sources.
 The main motivations are improved decision making and/or cost
reduction
© 2016 Autodesk | Enterprise Information Services 6
What about the Logical Data Lake?
 A Data Lake will not have a star or snowflake schema, but rather a
more heterogeneous collection of views with raw data from
heterogeneous sources
 The virtual layer will act as a common umbrella under which these
different sources are presented to the end user as a single system
 However, from the virtualization perspective, a Virtual Data Lake
shares many technical aspects with a LDW and most of these
contents also apply to a Logical Data Lake
© 2016 Autodesk | Enterprise Information Services
Introduction
© 2016 Autodesk | Enterprise Information Services 8
Agile
Data
Architecture
Logical Data
Warehouse
Enterprise
Access Point
Data
Virtualization
Logical Data
Lake
Agile Data Architecture Lifecycle
Rapid, iterative
delivery
Modeling of
business
concepts
Curated,
governed, fit for
purpose data
Stable, system-
independent,
durable views
Single source of
data across the
enterprise
Enterprise
governance and
access control
© 2016 Autodesk | Enterprise Information Services
Business Problem
© 2016 Autodesk | Enterprise Information Services 10
© 2016 Autodesk | Enterprise Information Services 11
Most of us are in the same boat
© 2016 Autodesk | Enterprise Information Services
The Autodesk Agile Data Architecture
© 2016 Autodesk | Enterprise Information Services 13
Philosophy
 Access and refine data
near the source
 Published logical data
interfaces
 Implementing interfaces is
only an IT concern
 Agile and opportunistic
retirement of legacy
systems
© 2016 Autodesk | Enterprise Information Services 14
Autodesk Data Architecture
© 2016 Autodesk | Enterprise Information Services 15
The journey to the Logical Data Lake Data virtualization can be used
throughout your data pipeline!
© 2016 Autodesk | Enterprise Information Services 16
Big Data Ecosystem
© 2016 Autodesk | Enterprise Information Services
Towards the Logical Data Lake
© 2016 Autodesk | Enterprise Information Services 18
Logical Data Warehouse (LDW)
 Usability
 Single repository for schema
definitions
 Integrity
 Only published views are
publically available
 Business ownership
guarantees the quality
Agile
Data
Architectur
e
Logical
Data
Warehouse
Enterprise
Access
Point
Data
Virtualization
Logical
Data Lake
© 2016 Autodesk | Enterprise Information Services 19
The Enterprise Access Point
 Availability
 A single access point to
administer
 Security
 A single point for
authentication,
authorization and audit trail
Agile
Data
Architectur
e
Logical
Data
Warehouse
Enterprise
Access
Point
Data
Virtualization
Logical
Data Lake
© 2016 Autodesk | Enterprise Information Services 20
Data Virtualization
 System-agnostic
 Data contract independent of
implementation
 Enables the agile enterprise
 Aggressively and
opportunistically refactor
enterprise systems
 Manifest support for federation
and data blending
Agile
Data
Architectur
e
Logical
Data
Warehouse
Enterprise
Access
Point
Data
Virtualization
Logical
Data Lake
© 2016 Autodesk | Enterprise Information Services 21
Logical Data Lake
 Beyond the traditional
Enterprise Data Lake
 Combine traditional data lakes
with other data
 Blending of disparate,
heterogeneous data sources
 Internal, external, 3rd party…
 Approaches a true, single
origin for all enterprise data
Agile
Data
Architectur
e
Logical
Data
Warehouse
Enterprise
Access
Point
Data
Virtualization
Logical
Data Lake
© 2016 Autodesk | Enterprise Information Services 22
Towards the Logical Data Lake implements the philosophy
 Access and refine data near the
source
 No painful ETL pipelines for data
derivation
 Published logical data interfaces
 Single access point for all of external
data sets
 Implementing interfaces is only an IT
concern
 Accelerate to best of breed solutions
 Agile and opportunistic retirement of
legacy systems
 Rip and replace becomes almost
transparent
© 2016 Autodesk | Enterprise Information Services
Building the Agile Data Architecture at Autodesk
© 2016 Autodesk | Enterprise Information Services 24
Implementation Approach
 Identify enterprise data sources
 Harder than you think
 All new streaming, highly-available
ingestion mechanism
 Self-service or nearly so
 Stream-based facilitates batch and
streaming data processing
 Leverage highly-redundant cloud
storage for the data lake
 S3
 Leverage best-of breed for individual
components
 Open source, selected commercial
vendors
 Develop canonical representations for your
data sets
 Very difficult – consider a CDO
 Virtualize the data warehouses and marts
with a next generation Logical DW
 New implementations leverage the
LDW
 Legacy migrates opportunistically
© 2016 Autodesk | Enterprise Information Services 25
Architecting the Data Virtualization Layer
Corporate
LDAP
Data Consumer
Data Sources
Data
DV Instance 1
Source
Repository
Code
Logging Infrastructure
Data
Audit
Audit
CI/CD
DV Instance n
…
© 2016 Autodesk | Enterprise Information Services 26
Build an Information Architecture
 Base views to abstract data sources
 Layered derived views to reflect successively refined
derivations
 Create the notion of publication for curated, externally
visible views
 Expose services on top of views to make views more
accessible
 Separate namespaces (schemas) by project or
subject area
 Build the notion of commonality for views shared
across schemas
 Naming conventions for all objects
 Data portal for one-stop shopping for data consumers
© 2016 Autodesk | Enterprise Information Services 27
Towards the Logical Data Lake can
be a liberating journey!
Performance Considerations in
Logical Data Warehouse/ Lakes
Ravi Shankar, CMO
Denodo
Query Optimization in Data Virtualization
Real-time Data Integration
“Data virtualization integrates disparate data sources in real time or near-real time
to meet demands for analytics and transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
Publishes
the data to applications
Combines
related data into views
Connects
to disparate data sources
2
3
1
Dynamic Query Rewriting for Big Data
30
1. Full Aggregation Pushdown
Total sales for each product in the last 2 years
Dynamic Query Rewriting for Big Data
31
2. Partition Pruning
Total sales for each product in the last 9 months
Dynamic Query Rewriting for Big Data
32
3. Partial Aggregation Pushdown
Total sales for each product category in the last 9 months
2. Total Sales by Product – Data Warehouse1. Products and their Categories – Products DB
3. Join both tables and aggregate – Data Virtualization
Dynamic Query Rewriting for Big Data
33
4. On the Fly Data Movement
Maximum discount for each product in the last 9 months
34
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of
decision support solutions including, but not limited to, Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
35
Performance Comparison
Query Description
Returned
Rows
Time Netezza
Time Denodo
(Federated Oracle,
Netezza & SQL Server)
Optimization Technique
(automatically selected)
Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and
year between 2000 and 2004
5.51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where
sale price less than current
list price
17.05 K 3.5 sec. 5.2 sec On the fly data movement
Logical Data Warehouse vs. Physical Data Warehouse
Dynamic Query Optimizer
36
Delivers Breakthrough Performance for Big Data, Logical Data Warehouse, and
Operational Scenarios
Dynamically determines lowest-cost query execution plan based on
statistics
Factors in all the special characteristics of big data sources such as
number of processing units and partitions
Can easily handle any number of incremental queries
Enables connectivity to the broadest array of big data sources such
as Redshift, Impala, Spark.
Best dynamic query optimization engine in the industry.
Kurt Jackson
Platform Lead
Ravi Shankar
Chief Marketing
Officer
Q&A
Next Steps
View Educational Seminar (on-demand): Logical Data
Warehouse, Data Lakes, and Data Services Marketplace
Visit: www.denodo.com
Read about Data Virtualization for Logical Data Warehouse and
Data Lakes
Visit: denodo.com/en/solutions/horizontal-solutions/logical-data-warehouse
Get Started! Download Denodo Express
Visit: www.denodo.com/en/denodo-platform/denodo-express
Access Denodo Platform on AWS
Visit: www.denodo.com/en/denodo-platform/denodo-platform-for-aws
Kurt Jackson
Platform Lead
Ravi Shankar
Chief Marketing
Officer
Thank You

Designing Fast Data Architecture for Big Data using Logical Data Warehouse and Data Lakes

  • 1.
    Designing Fast DataArchitecture for Big Data using Logical Data Warehouse and Data Lakes A Case Study presented by Kurt Jackson Platform Lead, Autodesk
  • 2.
    Kurt Jackson Platform Lead RaviShankar Chief Marketing Officer Speakers
  • 3.
    Agenda 1. Towards aLogical Data Lake – An Autodesk Case Study 2. Performance Considerations in Logical Data Warehouse/ Lakes 3. Q&A and Next Steps
  • 4.
    © 2016 Autodesk| Enterprise Information Services Towards the Logical Data Lake Kurt Jackson Platform Lead
  • 5.
    © 2016 Autodesk| Enterprise Information Services 5 What is a Logical Data Warehouse?  A logical data warehouse is a data system that follows the ideas of traditional EDW (star or snowflake schemas) and includes, in addition to one (or more) core DWs, data from external sources.  The main motivations are improved decision making and/or cost reduction
  • 6.
    © 2016 Autodesk| Enterprise Information Services 6 What about the Logical Data Lake?  A Data Lake will not have a star or snowflake schema, but rather a more heterogeneous collection of views with raw data from heterogeneous sources  The virtual layer will act as a common umbrella under which these different sources are presented to the end user as a single system  However, from the virtualization perspective, a Virtual Data Lake shares many technical aspects with a LDW and most of these contents also apply to a Logical Data Lake
  • 7.
    © 2016 Autodesk| Enterprise Information Services Introduction
  • 8.
    © 2016 Autodesk| Enterprise Information Services 8 Agile Data Architecture Logical Data Warehouse Enterprise Access Point Data Virtualization Logical Data Lake Agile Data Architecture Lifecycle Rapid, iterative delivery Modeling of business concepts Curated, governed, fit for purpose data Stable, system- independent, durable views Single source of data across the enterprise Enterprise governance and access control
  • 9.
    © 2016 Autodesk| Enterprise Information Services Business Problem
  • 10.
    © 2016 Autodesk| Enterprise Information Services 10
  • 11.
    © 2016 Autodesk| Enterprise Information Services 11 Most of us are in the same boat
  • 12.
    © 2016 Autodesk| Enterprise Information Services The Autodesk Agile Data Architecture
  • 13.
    © 2016 Autodesk| Enterprise Information Services 13 Philosophy  Access and refine data near the source  Published logical data interfaces  Implementing interfaces is only an IT concern  Agile and opportunistic retirement of legacy systems
  • 14.
    © 2016 Autodesk| Enterprise Information Services 14 Autodesk Data Architecture
  • 15.
    © 2016 Autodesk| Enterprise Information Services 15 The journey to the Logical Data Lake Data virtualization can be used throughout your data pipeline!
  • 16.
    © 2016 Autodesk| Enterprise Information Services 16 Big Data Ecosystem
  • 17.
    © 2016 Autodesk| Enterprise Information Services Towards the Logical Data Lake
  • 18.
    © 2016 Autodesk| Enterprise Information Services 18 Logical Data Warehouse (LDW)  Usability  Single repository for schema definitions  Integrity  Only published views are publically available  Business ownership guarantees the quality Agile Data Architectur e Logical Data Warehouse Enterprise Access Point Data Virtualization Logical Data Lake
  • 19.
    © 2016 Autodesk| Enterprise Information Services 19 The Enterprise Access Point  Availability  A single access point to administer  Security  A single point for authentication, authorization and audit trail Agile Data Architectur e Logical Data Warehouse Enterprise Access Point Data Virtualization Logical Data Lake
  • 20.
    © 2016 Autodesk| Enterprise Information Services 20 Data Virtualization  System-agnostic  Data contract independent of implementation  Enables the agile enterprise  Aggressively and opportunistically refactor enterprise systems  Manifest support for federation and data blending Agile Data Architectur e Logical Data Warehouse Enterprise Access Point Data Virtualization Logical Data Lake
  • 21.
    © 2016 Autodesk| Enterprise Information Services 21 Logical Data Lake  Beyond the traditional Enterprise Data Lake  Combine traditional data lakes with other data  Blending of disparate, heterogeneous data sources  Internal, external, 3rd party…  Approaches a true, single origin for all enterprise data Agile Data Architectur e Logical Data Warehouse Enterprise Access Point Data Virtualization Logical Data Lake
  • 22.
    © 2016 Autodesk| Enterprise Information Services 22 Towards the Logical Data Lake implements the philosophy  Access and refine data near the source  No painful ETL pipelines for data derivation  Published logical data interfaces  Single access point for all of external data sets  Implementing interfaces is only an IT concern  Accelerate to best of breed solutions  Agile and opportunistic retirement of legacy systems  Rip and replace becomes almost transparent
  • 23.
    © 2016 Autodesk| Enterprise Information Services Building the Agile Data Architecture at Autodesk
  • 24.
    © 2016 Autodesk| Enterprise Information Services 24 Implementation Approach  Identify enterprise data sources  Harder than you think  All new streaming, highly-available ingestion mechanism  Self-service or nearly so  Stream-based facilitates batch and streaming data processing  Leverage highly-redundant cloud storage for the data lake  S3  Leverage best-of breed for individual components  Open source, selected commercial vendors  Develop canonical representations for your data sets  Very difficult – consider a CDO  Virtualize the data warehouses and marts with a next generation Logical DW  New implementations leverage the LDW  Legacy migrates opportunistically
  • 25.
    © 2016 Autodesk| Enterprise Information Services 25 Architecting the Data Virtualization Layer Corporate LDAP Data Consumer Data Sources Data DV Instance 1 Source Repository Code Logging Infrastructure Data Audit Audit CI/CD DV Instance n …
  • 26.
    © 2016 Autodesk| Enterprise Information Services 26 Build an Information Architecture  Base views to abstract data sources  Layered derived views to reflect successively refined derivations  Create the notion of publication for curated, externally visible views  Expose services on top of views to make views more accessible  Separate namespaces (schemas) by project or subject area  Build the notion of commonality for views shared across schemas  Naming conventions for all objects  Data portal for one-stop shopping for data consumers
  • 27.
    © 2016 Autodesk| Enterprise Information Services 27 Towards the Logical Data Lake can be a liberating journey!
  • 28.
    Performance Considerations in LogicalData Warehouse/ Lakes Ravi Shankar, CMO Denodo
  • 29.
    Query Optimization inData Virtualization Real-time Data Integration “Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data.” – Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015 Publishes the data to applications Combines related data into views Connects to disparate data sources 2 3 1
  • 30.
    Dynamic Query Rewritingfor Big Data 30 1. Full Aggregation Pushdown Total sales for each product in the last 2 years
  • 31.
    Dynamic Query Rewritingfor Big Data 31 2. Partition Pruning Total sales for each product in the last 9 months
  • 32.
    Dynamic Query Rewritingfor Big Data 32 3. Partial Aggregation Pushdown Total sales for each product category in the last 9 months 2. Total Sales by Product – Data Warehouse1. Products and their Categories – Products DB 3. Join both tables and aggregate – Data Virtualization
  • 33.
    Dynamic Query Rewritingfor Big Data 33 4. On the Fly Data Movement Maximum discount for each product in the last 9 months
  • 34.
    34 Compares the performanceof a federated approach in Denodo with an MPP system where all the data has been replicated via ETL Customer Dim. 2 M rows Sales Facts 290 M rows Items Dim. 400 K rows * TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems. vs. Sales Facts 290 M rows Items Dim. 400 K rows Customer Dim. 2 M rows Performance Comparison Logical Data Warehouse vs. Physical Data Warehouse
  • 35.
    35 Performance Comparison Query Description Returned Rows TimeNetezza Time Denodo (Federated Oracle, Netezza & SQL Server) Optimization Technique (automatically selected) Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down Total sales by customer and year between 2000 and 2004 5.51 M 52.3 sec. 59.0 sec Full aggregation push-down Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down Total sales by item where sale price less than current list price 17.05 K 3.5 sec. 5.2 sec On the fly data movement Logical Data Warehouse vs. Physical Data Warehouse
  • 36.
    Dynamic Query Optimizer 36 DeliversBreakthrough Performance for Big Data, Logical Data Warehouse, and Operational Scenarios Dynamically determines lowest-cost query execution plan based on statistics Factors in all the special characteristics of big data sources such as number of processing units and partitions Can easily handle any number of incremental queries Enables connectivity to the broadest array of big data sources such as Redshift, Impala, Spark. Best dynamic query optimization engine in the industry.
  • 37.
    Kurt Jackson Platform Lead RaviShankar Chief Marketing Officer Q&A
  • 38.
    Next Steps View EducationalSeminar (on-demand): Logical Data Warehouse, Data Lakes, and Data Services Marketplace Visit: www.denodo.com Read about Data Virtualization for Logical Data Warehouse and Data Lakes Visit: denodo.com/en/solutions/horizontal-solutions/logical-data-warehouse Get Started! Download Denodo Express Visit: www.denodo.com/en/denodo-platform/denodo-express Access Denodo Platform on AWS Visit: www.denodo.com/en/denodo-platform/denodo-platform-for-aws
  • 39.
    Kurt Jackson Platform Lead RaviShankar Chief Marketing Officer Thank You