Your SlideShare is downloading. ×
Designing high performance datawarehouse
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Designing high performance datawarehouse

510
views

Published on

Just when the world of “Data 1.0” showed some signs of maturing; the “Outside In” driven demands seem to have already initiated some the disruptive changes to the data landscape. Parallel growth in …

Just when the world of “Data 1.0” showed some signs of maturing; the “Outside In” driven demands seem to have already initiated some the disruptive changes to the data landscape. Parallel growth in volume, velocity and variety of data coupled with incessant war on finding newer insights and value from data has posed a Big Question: Is Your Data Warehouse Relevant?

In short, the surrounding changes happening real time is the new “Data 2.0”. It is characterized by feeding the ever hungry minds with sharper insights whether it is related to regulation, finance, corporate action, risk management or purely aimed at improving operational efficiencies. The source in this new “Data 2.0” has to be commensurate to the outside in demands from customers, regulators, stakeholders and business users; and hence, you would need a high relformance (relevance + performance) data warehouse which will be relevant to your business eco-system and will have the power to scale exponentially.

We starts this webinar by giving the audiences a sneak preview of what happened in the Data 1.0 world & which characteristics are shaping the new Data 2.0 world. It then delves deep on the challenges that growing data volumes have posed to the Data warehouse teams. It also presents the audiences some of the practical and proven methodologies to address these performance challenges. Finally, in the end it will highlight some of the thought provoking ways to turbo charge your data warehouse related initiatives by leveraging some of the newer technologies like Hadoop. Overall, the webinar will educate audiences with building high performance and relevant data warehouses which is capable of meeting the newer demands while significantly driving down the total cost of ownership.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
510
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
49
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Welcome to the webinar on Designing High Performance Datawarehouse Presented by &
  • 2. Contents 1 What happened in the Data 1.0 World 2 What is shaping the new Data 2.0 World 3 Designing High Performance Datawarehouse 4 Q&A
  • 3. What happened in the Data 1.0 World? Before 2000 Do we need a DWH? 2000s Select success : top down & bottom up Advent of ODS Now Business led We’ve got BI / DWH Tools Volume | Variety | Velocity | Value Performance vs. Volume : Game Changer Need insights from nonstructured data as well Drill-down Reporting from DWH – getting into mainstream Analytics is a differentiator Data Silos Metrics for success? OLAP = Insights Painful Implementations Show me the ROI Standardized KPIs Analytics as differentiator? (DATA) Big, Real time, In-memory – what do with existing initiatives? Retaining skills and expertise Data 2.0 : scale, performance, knowledge, relevance
  • 4. Challenges in current DW environment - Survey 42% say Can’t scale to big data volumes 27% say Inadequate data load speed 27% say Poor query response 25% Existing DW modeled for reports & OLAP only 24% 24% 23% 19% Can’t score analytic models Fast enough 18% Cost of scaling up or out is too expensive 15% Can’t support high Concurrent user count 15% Inadequate support for In-memory processing 9% 18% Current platform needs great Manual effort for performance Poorly suited to real-time workloads Can’t support in-database analytics Poor CPU speed and capacity Current platform is a legacy, We must phase it out TDWI research based on 278 respondents – Top Responses`
  • 5. Social Media Data Data 2.0 World True Sentiment Faster Compliance Text Data Sensor Data High Performance Data Warehouse Concurrency Enabled Able to handle Complexity Ability to Scale Syndicated Data Faster Reach Speed Numeric Data Every 18 months, non-rich structured and unstructured enterprise data doubles. Big Data Analytics Analytics = Competitive Advantage Efficiencies driving down costs Customer experience & service Business is now equipped to consume, identify and act upon this data for superior insights
  • 6. So what is a High Performance Datawarehouse? Key Dimensions
  • 7. CONCURRENCY S P E E D HIGH PERFORMANCE DATA WAREHOUSE SCALE C O M P L E X I T Y
  • 8. CONCURRENCY      Streaming Big Data S  Event Processing P  Real time operation  Operational BI E  Near time Analytics E  Dashboard D Refresh  Fast Queries Competing Workloads – OLAP, Analytics Intraday data loads Thousands of users Ad hoc queries High Performance Data Warehouse     Big Data volumes Detailed source data Thousands of reports Scale out into: cloud, clusters, grids, etc. SCALE  Big Data variety  Unstructured  Sensor  Social media  Many sources / targets  Complex models and SQL  High availability C O M P L E X I T Y
  • 9. Designing High Performance Datawarehouse
  • 10. Industry recognized top techniques 45% say Creating Summary Tables 44% say 33% Adding Indexes say Altering SQL Statements or routines 24% 24% Changing physical data models 16% Using in-memory databases 21% 16% Upgrading Hardware 20% 16% Choosing between column-row oriented data storage Restricting or throttling user queries 15% Moving an application to a separate data mart 10% Applying workload to management controls Shifting some workloads to off-peak hours Adjusting system parameters 6% Others TDWI research based on 329 responses from 114 respondents
  • 11. Designing Summary Tables 45% say Creating Summary Tables
  • 12. Summary table design process A good sampling of queries. These may come from user interviews, testing / QA queries, COLLECT production queries, reports or any other means that provide a good representation of expected production queries ANALYZE IDENTIFY The dimension hierarchy levels, dimension attributes, and fact table measures that are required by each query or report. The row counts associated with each dimension level represented. The most commonly required dimension levels against the number of rows in the resulting BALANCE summary tables. A goal should be to design summary tables that are roughly 1/100th the size of the source fact tables in terms of rows (or less) MINIMIZE The columns that are carried in the summary table in favor of joining back to the dimension table. The larger the summary table, the less performance advantages it provides. Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next.
  • 13. Capturing requirements for Summary table •Choosing Aggregates to Create - There are two basic pieces of information which are required to select the appropriate aggregates. •Expected usage patterns of the data. •Data volumes and distributions in the fact table Report Date Calendar Year Measures Sales Sale_Amt Dimension Level Report 1 Dimension Level Store Item District Report 2 District Calendar Year Sales_Qty Sale_Amt Store Geography Report 3 District Calendar Month Calendar Year Sales_Qty Sale_Amt Calendar Month Fiscal Period Fiscal Week Fiscal Period Fiscal Week Sales_Qty Sale_Amt Sales_Qty Sale_Amt Sale_Amt Fiscal Week Sales_Qty Sale_Amt Division Region District Store Subject Category Department Fiscal Year Fiscal Quarter Fiscal Period Fiscal Week Report 4 Report 5 Report 6 Report 7 Report 8 Report 9 Report 10 Report 11 District Store Category Dept Dept District District District District Region Dept Category Fiscal Quarter Fiscal Period Fiscal Week Sales_Qty Sale_Amt Sales_Qty Item Category Date # Populated of Members 1 3 50 3980 279 1987 4145 3 12 36 156
  • 14. Summary table design considerations Aggregate storage column selection  Semi-additive and all non-additive fact data – need not be stored in the summary table  Add as many “pre calculated” columns as possible  “Count” columns could be added for non additive facts to preserve a portion of the information Recreating vs. Updating Aggregates  Efficient for aggregation programs to update the aggregate tables with the newly loaded data  Regeneration more appropriate if there is a lot of program logic to determine what data must be updated in the aggregate table Storing Aggregate Rows  A combined table containing basic level fact rows and aggregate rows  A single aggregate table which holds all aggregate data for a single base fact table  A separate table for each aggregate created – Most preferred option Storing Aggregate Dimension Data  Multiple hierarchies in a single dimension  Store all of the aggregate dimension records together in a single table  Use a separate table for each level in the dimension  Add dimension data to aggregate fact table
  • 15. Efficient Indexing for Datawarehouse 44% say Adding Indexes
  • 16. Dimension table indexing Create a non clustered, primary key on the surrogate key of each dimension table • A clustered index on the business key should be considered. • Enhance the query response when the business key is used in the WHERE clause. • Help avoid lock escalation during ETL process • For large type 2 SCDs, create a four-part non-clustered index : business key, record begin date, record end date and surrogate key • Create non-clustered indexes on columns in the dimension that will be used for searching, sorting, or grouping,. • If there’s a hierarchy in a dimension, such as Category- Sub Category-Product ID, then create index on Hierarchy Index Type EmployeeKey • Index columns Non clustered EmployeeNationalIDAlternateKey clustered EmployeeNationalIDAlternateKey, StartDate, EndDate EmployeeKey Non clustered FirstName LastName DeoartmentName Non clustered
  • 17. Fact table indexing Index columns Index Type clustered • Create a clustered, composite index composed of each of the foreign keys to the fact tables OrderDateKey ProductKey CustomerKey PromotionKey CurrencyKey SalesTerritoryKey DueDateKey • Keep the most commonly queried date column as the leftmost column in the index • There can be more than one date in the fact table but there is usually one date that is of the most interest to business users. A clustered index on this column has the effect of quickly segmenting the amount of data that must be evaluated for a given query
  • 18. Column Oriented databases
  • 19. Row Store and Column Store Most of the queries does not process all the attributes of a particular relation. Row Store Column Store (+) Easy to add/modify a record (+) Only need to read in relevant data (-) Might read in unnecessary data (-) Tuple writes require multiple accesses • One can obtain the performance benefits of a column-store using a row-store by making some changes to the physical structure of the row store. – Vertically partitioning – Using index-only plans – Using materialized views
  • 20. Vertical Partitioning • Process: – Full Vertical partitioning of each relation • Each column =1 Physical table • This can be achieved by adding integer position column to every table • Adding integer position is better than adding primary key – Join on Position for multi column fetch
  • 21. Index-only plans • Process: – Add B+Tree index for every Table.column – Plans never access the actual tuples on disk – Headers are not stored, so per tuple overhead is less
  • 22. Using Hadoop for Datawarehouse
  • 23. Ecosystem of open Source projects Metadata Management (Hcatlog) Distributed Processing (MapReduce) Distributed Storage (HDFS) Hosted by Apache Foundation Query (Pig) Google developed and shared concepts (Hcatlog APIs, WebHDFS, Talend Open Studio for Big Data, Sqoop) Scripting (Pig) Data Extraction & Loading Non-Relational Database (Hbase) Workflow & Scheduling (Oozie) Management & Monitoring (Ambari, Zookeeper) Hadoop ecosystem Distributed File System that has the ability to scale out
  • 24. Promising uses of Hadoop in DW context Data Staging Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily accessible online environment Data archiving Schema flexibility Hadoop enables the growing practice of “late binding” – instead of transforming data as it’s ingested by Hadoop, structure is applied at runtime Hadoop allows organizations to deploy an extremely scalable and economical ETL environment Hadoop can quickly and easily ingest any data format Processing flexibility Distributed DW architecture Off load workloads for big data and advanced analytics to HDFS, discovery platforms and MapReduce
  • 25. What led to Datawarehouse at Facebook The Problem The Hadoop Experiment Challenges with Hadoop Data, data and more data Superior in availability, scalability Programmability & Metadata  200 GB per day in And Manageability compared March 2008 to commercial Databases 2+ TB (compressed) per day Uses Hadoop File System (HDFS)  Map Reduce hard to program Need to publish data in well known schemas HIVE What is Hive? Key Building Principles Tables A system for managing and querying structured data built on top of Hadoop SQL on structured data as a familiar data warehousing tool Each table has a corresponding directory in HDFS Uses Map Reduce for execution Pluggable map/reduce scripts in language of your choice: Rich Data Types Uses HDFS for storage Performance Each table points to existing data directories in HDFS Split data based on hash of a column – mainly for parallelism
  • 26. Analytical platforms
  • 27. Analytical platforms overview 1010data Aster Data (Teradata) Calpont Datallegro (Microsoft) Exasol Greenplum (EMC) IBM SmartAnalytics Infobright Kognitio Netezza (IBM) Oracle Exadata Paraccel Pervasive Sand Technology SAP HANA Sybase IQ (SAP) Teradata Vertica (HP) Purpose-built database management systems designed explicitly for query processing and analysis that provides dramatically higher price/performance and availability compared to general purpose solutions. Deployment Options -Software only (Paraccel, Vertica) -Appliance (SAP, Exadata, Netezza) -Hosted(1010data, Kognitio) • Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations • AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted marketing
  • 28. Which platform do you choose? Hadoop Analytic Database General Purpose RDBMS Structured  Semi-Structured  Unstructured
  • 29. Thank You Please send your Feedback & Corporate Training /Consulting Services requirements on BI to sameer@compulinkacademy.com