Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cloudera, Data Warehouse Optimisation 
Jérôme Campo, Systems Engineering 
MAY 2014
The Enterprise Data Warehouse 
SERVERS 
MARTS 
DW 
DOCUMENTS 
STORAGE 
SEARCH 
ARCHIVE 
ERP, CRM, RDBMS, MACHINES 
FILES, ...
Cloudera for the Enterprise Data Hub 
Multi-workload analytic platform 
•Bring applications to data 
•Combine different wo...
Cloudera for the Enterprise Data Hub
Cloudera for Data Warehouse optimisation
EDW optimisation: Active Archive 
6 
Archive datasets 
Infrequently accessed tables 
Large, corpus of data 
Frequency of d...
EDW optimisation: Transformation 
7 
High-scale batch data processing 
Implemented as SQL + scripting or ETL running on ex...
EDW optimisation: Self Service BI 
8 
Self-Service BI, Exploratory BI, Data Discovery 
Uncertain business questions and un...
EDW optimisation: Multi-workload 
9 
Training & scoringpredictive models 
Deep and broad data sets, within and beyond the ...
Why EDW optimisation? 
1.Lower costs of data management, allow growth 
2.Improve quality of service 
•Shorten ETL windows ...
Customers agree, Cloudera delivers 
Customer 
Workload 
Results 
Leading Payments Company 
Analytics, ETL Processing, DR 
...
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
Upcoming SlideShare
Loading in …5
×

BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation

Optimisation de l'entrepôt de données par Jérôme Campo , Cloudera

  • Be the first to comment

  • Be the first to like this

BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation

  1. 1. Cloudera, Data Warehouse Optimisation Jérôme Campo, Systems Engineering MAY 2014
  2. 2. The Enterprise Data Warehouse SERVERS MARTS DW DOCUMENTS STORAGE SEARCH ARCHIVE ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES Complex Architecture •Many special-purposesystems, silos of data •Moving data around •No complete views 4 Visibility •Leaving data behind •Risk and compliance •High cost of storage 1 Time to Data •Up-front modeling •Transforms slow •Transforms lose data 2 Cost of Analytics •Existing systems strained •No agility •BI backlog 3
  3. 3. Cloudera for the Enterprise Data Hub Multi-workload analytic platform •Bring applications to data •Combine different workloads on common data (i.e. SQL + Search) •True BI agility 4 Active archive •Full fidelity original data •Indefinite time, any source •Lowest cost storage 1 Data management, transforms •One source of data for all analytics •Persist state of transformed data •Significantly faster & cheaper 2 Self-service exploratory BI •Simple search + BI tools •“Schema on read” agility •Reduce BI user backlog requests 3 SERVERS MARTS DW DOCUMENTS STORAGE SEARCH ARCHIVE ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
  4. 4. Cloudera for the Enterprise Data Hub
  5. 5. Cloudera for Data Warehouse optimisation
  6. 6. EDW optimisation: Active Archive 6 Archive datasets Infrequently accessed tables Large, corpus of data Frequency of data access Changing regulatory compliance requirements Data volume growth Data remains accessible Data is not lost 1/10ththe cost What to Migrate Influencing Factors Better in Cloudera Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades Low-latency SQL processing, ability to absorb short-cycle ELT Broad support of leading data integration tools Only Available with Cloudera Key Partners
  7. 7. EDW optimisation: Transformation 7 High-scale batch data processing Implemented as SQL + scripting or ETL running on expensive HW infrastructure Staging data stored across diverse, temp tables High fraction of overall EDW utilization (25 –80%) Difficult to store, manage staging data in relational form Limited user adoption risk to migrate ETL tools to simplify migration Over 2X the performance 1/10ththe cost Persistent staging, tracked lineage What to Migrate Influencing Factors Better in Cloudera Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades Low-latency SQL processing, ability to absorb short-cycle ELT Broad support of leading data integration tools Only Available with Cloudera Key Partners
  8. 8. EDW optimisation: Self Service BI 8 Self-Service BI, Exploratory BI, Data Discovery Uncertain business questions and uncertain data Fastest growing workload for many warehouses Comparable support for end user tools between Cloudera and DBMS products Schema flexibility End user self-service on full fidelity data 1/10ththe cost Workload Migration Priority Better In Cloudera Open source parallel interactive SQL engine: Cloudera Impala Integration and certification of every leading SSBI vendor Only Available with Cloudera Key Partners
  9. 9. EDW optimisation: Multi-workload 9 Training & scoringpredictive models Deep and broad data sets, within and beyond the warehouse Statisticians want unconstrained analysis; limited DW compute resources Paying top dollar for warehouse data storage only to load into ML tools Inability to analyze data beyond the warehouse Greater user productivity(pre-packaged ML libraries, no more down-sampling) Support for 3rdparty ML tools Greater flexibility(SQL + MR + Search + Spark + SAS procs) 1/10ththe cost Workload and Data Influencing Factors Better in Cloudera Ability to run SAS, R natively on the same cluster Interactive search and SQL experience for data exploration Built-in analytics libraries (Mahout, DataFu, ClouderaML) Support from Cloudera’s Data Science team Only Available with Cloudera Key Partners
  10. 10. Why EDW optimisation? 1.Lower costs of data management, allow growth 2.Improve quality of service •Shorten ETL windows •Faster BI queries 3.Extend existing warehouse capacity •Increase ROI from current investments •More operational data –volume and schemas •More business intelligence and analytics workloads 4.Retain all data for more varied analysis 5.Deliver a foundation for innovation •Bring more applications to Hadoop data for low incremental cost
  11. 11. Customers agree, Cloudera delivers Customer Workload Results Leading Payments Company Analytics, ETL Processing, DR Largest fraud discovery in firm history Time to report collapsedfrom 2 days => 2 hours Save $30M on DR Global Money Center Bank DataProcessing (ELT) Avoidedtens of millions in expansion purchases 42% faster processing MobileDevice Manufacturer Data Processing (ELT) Offloaded 90% ofdata volume; keep all data Fortune500 Retailer Analytics Moreinsights by supporting more exploration of more extensive & granular data Leading Financial Regulator DataProcessing (ELT) and DR Shrank EDW footprint by 4PB, 20X perf. boost

×