Cloudera, Data Warehouse Optimisation 
Jérôme Campo, Systems Engineering 
MAY 2014
The Enterprise Data Warehouse 
SERVERS 
MARTS 
DW 
DOCUMENTS 
STORAGE 
SEARCH 
ARCHIVE 
ERP, CRM, RDBMS, MACHINES 
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS 
EXTERNAL DATA SOURCES 
Complex Architecture 
•Many special-purposesystems, silos of data 
•Moving data around 
•No complete views 
4 
Visibility 
•Leaving data behind 
•Risk and compliance 
•High cost of storage 
1 
Time to Data 
•Up-front modeling 
•Transforms slow 
•Transforms lose data 
2 
Cost of Analytics 
•Existing systems strained 
•No agility 
•BI backlog 
3
Cloudera for the Enterprise Data Hub 
Multi-workload analytic platform 
•Bring applications to data 
•Combine different workloads on common data (i.e. SQL + Search) 
•True BI agility 
4 
Active archive 
•Full fidelity original data 
•Indefinite time, any source 
•Lowest cost storage 
1 
Data management, transforms 
•One source of data for all analytics 
•Persist state of transformed data 
•Significantly faster & cheaper 
2 
Self-service exploratory BI 
•Simple search + BI tools 
•“Schema on read” agility 
•Reduce BI user backlog requests 
3 
SERVERS 
MARTS 
DW 
DOCUMENTS 
STORAGE 
SEARCH 
ARCHIVE 
ERP, CRM, RDBMS, MACHINES 
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS 
EXTERNAL DATA SOURCES
Cloudera for the Enterprise Data Hub
Cloudera for Data Warehouse optimisation
EDW optimisation: Active Archive 
6 
Archive datasets 
Infrequently accessed tables 
Large, corpus of data 
Frequency of data access 
Changing regulatory compliance requirements 
Data volume growth 
Data remains accessible 
Data is not lost 
1/10ththe cost 
What to Migrate 
Influencing Factors 
Better in Cloudera 
Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades 
Low-latency SQL processing, ability to absorb short-cycle ELT 
Broad support of leading data integration tools 
Only Available with Cloudera 
Key Partners
EDW optimisation: Transformation 
7 
High-scale batch data processing 
Implemented as SQL + scripting or ETL running on expensive HW infrastructure 
Staging data stored across diverse, temp tables 
High fraction of overall EDW utilization (25 –80%) 
Difficult to store, manage staging data in relational form 
Limited user adoption risk to migrate 
ETL tools to simplify migration 
Over 2X the performance 
1/10ththe cost 
Persistent staging, 
tracked lineage 
What to Migrate 
Influencing Factors 
Better in Cloudera 
Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades 
Low-latency SQL processing, ability to absorb short-cycle ELT 
Broad support of leading data integration tools 
Only Available with Cloudera 
Key Partners
EDW optimisation: Self Service BI 
8 
Self-Service BI, Exploratory BI, Data Discovery 
Uncertain business questions and uncertain data 
Fastest growing workload for many warehouses 
Comparable support for end user tools between Cloudera and DBMS products 
Schema flexibility 
End user self-service on full fidelity data 
1/10ththe cost 
Workload 
Migration Priority 
Better In Cloudera 
Open source parallel interactive SQL engine: Cloudera Impala 
Integration and certification of every leading SSBI vendor 
Only Available with Cloudera 
Key Partners
EDW optimisation: Multi-workload 
9 
Training & scoringpredictive models 
Deep and broad data sets, within and beyond the warehouse 
Statisticians want unconstrained analysis; limited DW compute resources 
Paying top dollar for warehouse data storage only to load into ML tools 
Inability to analyze data beyond the warehouse 
Greater user productivity(pre-packaged ML libraries, no more down-sampling) 
Support for 3rdparty ML tools 
Greater flexibility(SQL + MR + Search + Spark 
+ SAS procs) 
1/10ththe cost 
Workload and Data 
Influencing Factors 
Better in Cloudera 
Ability to run SAS, R natively on the same cluster 
Interactive search and SQL experience for data exploration 
Built-in analytics libraries (Mahout, DataFu, ClouderaML) Support from Cloudera’s Data Science team 
Only Available with Cloudera 
Key Partners
Why EDW optimisation? 
1.Lower costs of data management, allow growth 
2.Improve quality of service 
•Shorten ETL windows 
•Faster BI queries 
3.Extend existing warehouse capacity 
•Increase ROI from current investments 
•More operational data –volume and schemas 
•More business intelligence and analytics workloads 
4.Retain all data for more varied analysis 
5.Deliver a foundation for innovation 
•Bring more applications to Hadoop data for low incremental cost
Customers agree, Cloudera delivers 
Customer 
Workload 
Results 
Leading Payments Company 
Analytics, ETL Processing, DR 
Largest fraud discovery in firm history 
Time to report collapsedfrom 2 days => 2 hours 
Save $30M on DR 
Global Money Center Bank 
DataProcessing (ELT) 
Avoidedtens of millions in expansion purchases 
42% faster processing 
MobileDevice Manufacturer 
Data Processing (ELT) 
Offloaded 90% ofdata volume; keep all data 
Fortune500 Retailer 
Analytics 
Moreinsights by supporting more exploration of more extensive & granular data 
Leading Financial Regulator 
DataProcessing (ELT) and DR 
Shrank EDW footprint by 4PB, 20X perf. boost
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation

BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation

  • 1.
    Cloudera, Data WarehouseOptimisation Jérôme Campo, Systems Engineering MAY 2014
  • 2.
    The Enterprise DataWarehouse SERVERS MARTS DW DOCUMENTS STORAGE SEARCH ARCHIVE ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES Complex Architecture •Many special-purposesystems, silos of data •Moving data around •No complete views 4 Visibility •Leaving data behind •Risk and compliance •High cost of storage 1 Time to Data •Up-front modeling •Transforms slow •Transforms lose data 2 Cost of Analytics •Existing systems strained •No agility •BI backlog 3
  • 3.
    Cloudera for theEnterprise Data Hub Multi-workload analytic platform •Bring applications to data •Combine different workloads on common data (i.e. SQL + Search) •True BI agility 4 Active archive •Full fidelity original data •Indefinite time, any source •Lowest cost storage 1 Data management, transforms •One source of data for all analytics •Persist state of transformed data •Significantly faster & cheaper 2 Self-service exploratory BI •Simple search + BI tools •“Schema on read” agility •Reduce BI user backlog requests 3 SERVERS MARTS DW DOCUMENTS STORAGE SEARCH ARCHIVE ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
  • 4.
    Cloudera for theEnterprise Data Hub
  • 5.
    Cloudera for DataWarehouse optimisation
  • 6.
    EDW optimisation: ActiveArchive 6 Archive datasets Infrequently accessed tables Large, corpus of data Frequency of data access Changing regulatory compliance requirements Data volume growth Data remains accessible Data is not lost 1/10ththe cost What to Migrate Influencing Factors Better in Cloudera Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades Low-latency SQL processing, ability to absorb short-cycle ELT Broad support of leading data integration tools Only Available with Cloudera Key Partners
  • 7.
    EDW optimisation: Transformation 7 High-scale batch data processing Implemented as SQL + scripting or ETL running on expensive HW infrastructure Staging data stored across diverse, temp tables High fraction of overall EDW utilization (25 –80%) Difficult to store, manage staging data in relational form Limited user adoption risk to migrate ETL tools to simplify migration Over 2X the performance 1/10ththe cost Persistent staging, tracked lineage What to Migrate Influencing Factors Better in Cloudera Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades Low-latency SQL processing, ability to absorb short-cycle ELT Broad support of leading data integration tools Only Available with Cloudera Key Partners
  • 8.
    EDW optimisation: SelfService BI 8 Self-Service BI, Exploratory BI, Data Discovery Uncertain business questions and uncertain data Fastest growing workload for many warehouses Comparable support for end user tools between Cloudera and DBMS products Schema flexibility End user self-service on full fidelity data 1/10ththe cost Workload Migration Priority Better In Cloudera Open source parallel interactive SQL engine: Cloudera Impala Integration and certification of every leading SSBI vendor Only Available with Cloudera Key Partners
  • 9.
    EDW optimisation: Multi-workload 9 Training & scoringpredictive models Deep and broad data sets, within and beyond the warehouse Statisticians want unconstrained analysis; limited DW compute resources Paying top dollar for warehouse data storage only to load into ML tools Inability to analyze data beyond the warehouse Greater user productivity(pre-packaged ML libraries, no more down-sampling) Support for 3rdparty ML tools Greater flexibility(SQL + MR + Search + Spark + SAS procs) 1/10ththe cost Workload and Data Influencing Factors Better in Cloudera Ability to run SAS, R natively on the same cluster Interactive search and SQL experience for data exploration Built-in analytics libraries (Mahout, DataFu, ClouderaML) Support from Cloudera’s Data Science team Only Available with Cloudera Key Partners
  • 10.
    Why EDW optimisation? 1.Lower costs of data management, allow growth 2.Improve quality of service •Shorten ETL windows •Faster BI queries 3.Extend existing warehouse capacity •Increase ROI from current investments •More operational data –volume and schemas •More business intelligence and analytics workloads 4.Retain all data for more varied analysis 5.Deliver a foundation for innovation •Bring more applications to Hadoop data for low incremental cost
  • 11.
    Customers agree, Clouderadelivers Customer Workload Results Leading Payments Company Analytics, ETL Processing, DR Largest fraud discovery in firm history Time to report collapsedfrom 2 days => 2 hours Save $30M on DR Global Money Center Bank DataProcessing (ELT) Avoidedtens of millions in expansion purchases 42% faster processing MobileDevice Manufacturer Data Processing (ELT) Offloaded 90% ofdata volume; keep all data Fortune500 Retailer Analytics Moreinsights by supporting more exploration of more extensive & granular data Leading Financial Regulator DataProcessing (ELT) and DR Shrank EDW footprint by 4PB, 20X perf. boost