Real World Performance - Data Warehouses

The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.

Performance Summit – Intro and Data
Warehouse Performance
Andrew Holdsworth, Tom Kyte, Graham Wood
Server Technologies, Oracle Corp

Agenda
• 09:00 - 10:30 session
• 10:30 - 10:45 break
• 10:45 - 12:30 session
• 12:30 - 13:30 lunch
• 13:30 - 15:00 session
• 15:00 - 15:15 break
• 15:15 - ~17:00 session / Q&A / wrapup
• http://asktom.oracle.com ->
– Files Tab ->
• realworld.zip

Andrew Holdsworth
Senior Director
Real World Performance
Server Technologies

Tom Kyte
• Been with Oracle since 1993
• User of Oracle since 1987
• The “Tom” behind AskTom in
Oracle Magazine
www.oracle.com/oramag
• Expert Oracle Database
Architecture
• Effective Oracle by Design
• Expert One on One Oracle
• Beginning Oracle

Graham Wood
Architect
Server Technologies

• Make the experts prove everything
• Statements that should raise your
eyebrows:
– It is my opinion...
– I claim...
– I think...
– I feel…
– I KNOW…
• Everything can (and should) be proven
• Things change, expect that
• It only takes a single counter case
• “It depends” or “Why” are the only
answers you need
“Question Authority.”

© 2009 Oracle Corporation – Proprietary and Confidential
Why 3 Screens
• Interpreting the windows
– Demonstration menus
– Monitoring the Database Machine
• Begin loading 1 Terabyte of data
– Create Tablespaces
– Create Schema Objects
– One-Off Load
– Gather Statistics

Is Data Loading Really the Problem
• Where have Oracle Customers struggled with the
Performance of their Data Warehouse ?
– Data Loading ?
– Data Validation and Verification ?
– ETL and Transformation ?
• So what went Wrong ?

Data Warehouse Death Spiral
• HW CPU Sizing 10X
– Sized like an OLTP System
• I/O Sizing 10X
– Sized by Space requirements
– Cannot use Parallel Query
• Using the the incorrect Query Optimization
Techniques 10X
– Over Indexed Database
– Data Loads and ETL running to Slow
• System Over loaded to Make the CPU look Busy
– 100s of Concurrent Queries taking Hours to Execute

Some Basic Maths
• Index Driven Query retrieving 1,000,000 rows
– Assume the Index is cache and the data is not.
• 1,000,000 random IOPS @ 5ms per I/O
• This required 5000 Seconds to Execute
• This is why queries may take over an hour
– How much data could you scan in 5000 Seconds with a fully
sized I/O system able to scan 28 Gig/Sec ?
• Clearly For Oracle Data Warehouses the game is
changing
– New Design Techniques
– Time to Re-Train the DBAs

So, joe (or josephine) sql coder needs to run the following query:
select t1.object_name, t2.object_name
from t t1, t t2
where t1.object_id = t2.object_id
and t1.owner = 'WMSYS'
Rows Row Source Operation
------- ---------------------------------------------------
528384 HASH JOIN
8256 TABLE ACCESS FULL T
suppose they ran it or explain planned it -- and saw that plan.
"Stupid stupid CBO" they say -- "I have indexes, why won't it use
them. We all know that indexes mean fast=true! Ok, let me use the
faithful RBO and see what happens"
Mythology – why isn’t it using my index

select /*+ RULE */ t1.object_name, t2.object_name
from t t1, t t2
where t1.object_id = t2.object_id
and t1.owner = 'WMSYS'
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: RULE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 NESTED LOOPS
3 2 TABLE ACCESS (FULL) OF 'T'
4 2 INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE)
See, now that’s what I’m talking about – indexes are good…
Or are they?

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 35227 5.63 9.32 23380 59350 0 528384
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 35229 5.63 9.33 23380 59350 0 528384
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 80
Rows Row Source Operation
------- ---------------------------------------------------
528384 HASH JOIN

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 35227 912.07 3440.70 1154555 121367981 0 528384
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 35229 912.07 3440.70 1154555 121367981 0 528384
Misses in library cache during parse: 0
Optimizer goal: RULE
Parsing user id: 80
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=HINT: RULE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T'
2 1 NESTED LOOPS
3 2 TABLE ACCESS (FULL) OF 'T'
4 2 INDEX (RANGE SCAN) OF 'T_IDX' (NON-UNIQUE)

1 SELECT phy.value,
2 cur.value,
3 con.value,
4 1-((phy.value)/((cur.value)+(con.value))) "Cache hit ratio"
5 FROM v$sysstat cur, v$sysstat con, v$sysstat phy
6 WHERE cur.name='db block gets'
7 AND con.name='consistent gets'
8* AND phy.name='physical reads'
VALUE VALUE VALUE Cache hit ratio
-------- ---------- ---------- ---------------
1277377 58486 121661490 .989505609
98.9% cache hit, not bad eh?

Oracle Retail Data Warehouse Schema

Retail Demonstration
Table Sizes
TABLE SIZE OF SOURCE DATA NUMBER OF RECORDS
• Transactions • 52 GByte • 461M
• Payments • 54 GByte • 461M
• Line Items • 936 Gbyte • 6945M
• Total • 1042 GByte • 7867M

Table Sizes (Default Compression)
TABLE SIZE OF TABLE COMPRESSION RATIO
• Transactions • 30 GByte • 1.77 : 1
• Payments • 30 GByte • 1.84 : 1
• Line Items • 268 Gbyte • 3.55 : 1
• Total • 327 GByte • 3.23 : 1

Table Sizes (With HCC)
TABLE SIZE OF TABLE COMPRESSION RATIO
• Transactions • 5 GByte • 7.00 : 1
• Payments • 5 GByte • 7.60 : 1
• Line Items • 54 Gbyte • 12.85 : 1
• Total • 64 GByte • 11.98 : 1
NOTE: The compression ratio is compare to the data of no-compression

Data Loading
Bulk Loading Challenges
• Problem: Moving data to the database host machine
– For high load rates the data staging machine and network
becomes the serialization point/bottleneck
– Increased network and staging area I/O bandwidth is an
expensive option
• Solution: Compress the source data files
– Compression reduces the number of bytes copied from disk
and over the network

Data Loading
Tip #1 Consider the Data Transfer Rate
• What would it take to load 1 TByte in one Hour?
– 17 GByte/minute or 291 MByte/second
• This is higher than the specification of most networks
and any portable drive
• So compression of source data becomes crucial
– 1057 Gbyte  136 Gbyte (7.7x compression)
– 2.3 GByte/minute or 40 MByte/s
• This eliminates the first challenge to migrating data
• Extraction of the data from legacy systems often
takes much longer than this!

Data Staging
Data Sources
SOURCE THROUGHPUT
• USB Drive • 20 MByte/s
• Local Disk • 30-40 MByte/s
• Scalable NFS Server
• Potentially at Network
Speeds
• DBFS
• Fastest ( assuming data
has been copied !)

Data Loading
Bulk Loading Challenges
• Problem: Data loading is CPU/Memory Constrained
– Data loads scale well over multiple CPUs, cores and hosts
(assuming no other form of contention)
– Memory usage for meta data associated with highly
partitioned objects can become significant at high DOP
• Solution: Use the correct tools and plan accordingly
– Use external tables with a parallel SQL statement (e.g. CTAS
or IAS) to minimize on-disk and in-memory meta data. Do
NOT use multiple copies of SQL*Loader
– Data types for columns have a huge impact on the CPU
required to load the data. Raw is the cheapest and
Timestamp is the most expensive.

Data Loading
Anatomy of an External Table
create table FAST_LOAD
(
column definition list ...
)
organization external
(type oracle_loader
default directory SPEEDY_FILESYSTEM
preprocessor exec_file_dir:’zcat.sh’
characterset ‘ZHS16GBK’
badfile ERROR_DUMP:’FAST_LOAD.bad’
logfile ERROR_DUMP:’FAST_LOAD.log’
(
file column mapping list ...
)
location
(file_1.gz, file_2.gz, file_3.gz, file_4.gz )
reject limit 1000
parallel 4
/
External Table
Definition
Reference the
Mount Point
Uncompress the
data using a secure
wrapper
The Characterset
must match the
Characterset of the
Files
Note Compressed
Files
Parallel should
match or be less
than the number
of Files

Loading Data
Tip #2 Learn About Impact of Compression
• Compression incurs costs when loading
– Increased CPU time
– Increased elapsed time
• Compression provides benefits
– For scans
– For backup and recovery
• Write-once and Read-many means that compression
is a net benefit, not a cost

Loading Data
Tip #3 Learn about Impact of Partitioning
• Partitioning incurs costs when loading
– Increased CPU time
– Increased elapsed time
• Partitioning provides benefits
– For queries
– For manageability
• Write-once and Read-many means that partitioning is
a net benefit, not a cost

Gathering Statistics
Strategy For New Databases
• Create tables
• Optionally Run (or explain) queries on empty tables
– Prime / Seed the optimizer
• Enable incremental statistics
– For large partitioned tables
• Load data
• Gather statistics
– Use the defaults
• Create indexes (if required!)

Incremental Statistics
• One of the biggest problems with large tables is
keeping the schema statistics up to date and accurate
• This is particularly challenging in a Data Warehouse
where tables continue to grow and so the statistics
gathering time and resources grow proportionately
• To address this problem, 11.1 introduced the concept
of incremental statistics for partitioned objects
• This means that statistics are gathered for recently
modified partitions

The Concept of Synopses
• It is not possible to simply add partition statistics
together to create an up to date set of global statistics
• This is because the Number of Distinct Values (NDV)
for a partition may include values common to multiple
partitions.
• To resolve this problem, compressed representations
of the distinct values of each column are created in a
structure in the SYSAUX tablespace known as a
synopsis

Synopsis Example
Object Column Values NDV
Partition #1 1,1,3,4,5 4
Partition #2 1,2,3,4,5 5
NDV by addition WRONG 9
NDV by Synopsis CORRECT 5

Using Services To Manage Resources
• Services can be used to isolate different workloads
Load Query

Resource Management
• The Debate
– Why would you do it ?
– How should you do it ?
– Where are the sweet spots ?

Validation Example
Set based processing vs. row by row
Time
H:MI:SS

Validation and Transformation
Proof Points
• For the two validation processes we can now make
these claims
– Store Validation - Over 200 times faster
– Product Validation - Over 2500 times faster
• Same Hardware!
– This is a case of using the wrong methodology

Ad Hoc Query
Question
“What were the most popular items in the baskets of
shoppers who visited stores in California in the first
week of May and didn't buy bananas?”

Ad Hoc Query
SQL
with qbuy as
( select rt.TRX_NBR
from DWR_ORG_BSNS_UNIT obu, DWB_RTL_TRX rt, DWB_RTL_SLS_RETRN_LINE_ITEM rsrli, DWR_SKU_ITEM sku
where obu.ORG_BSNS_UNIT_KEY = rt.BSNS_UNIT_KEY
and rt.TRX_NBR = rsrli.TRX_NBR
and rt.DAY_KEY = rsrli.DAY_KEY
and rsrli.SKU_ITEM_KEY = sku.SKU_ITEM_KEY
and rt.DAY_KEY between 20090501 and 20090507
and obu.STATE in 'CA’
and sku.SKU_ITEM_DESC = 'Bananas'),
qall as
( select rt.TRX_NBR
from DWR_ORG_BSNS_UNIT obu, DWB_RTL_TRX rt
where obu.ORG_BSNS_UNIT_KEY = rt.BSNS_UNIT_KEY
and rt.DAY_KEY between 20090501 and 20090507
and obu.STATE in 'CA')
select sku.SKU_ITEM_DESC,q.SCANS
from
( select SKU_ITEM_KEY,count(*) as SCANS,rank() over (order by count(*) desc) as POP
from qall,qbuy, DWB_RTL_SLS_RETRN_LINE_ITEM rsrli
where qall.TRX_NBR = qbuy.TRX_NBR(+)
and qbuy.TRX_NBR IS NULL
and rsrli.TRX_NBR = qall.TRX_NBR
and rsrli.DAY_KEY between 20090501 and 20090507
group by SKU_ITEM_KEY) q, DWR_SKU_ITEM sku
where q.SKU_ITEM_KEY = sku.SKU_ITEM_KEY
order by q.POP asc;
4 Table join to select
all transactions buying
Bananas in California
in the first week of May
2 Table join to select
all transactions in
California in the first
week of May
Join the results sets in an
outer join to find the
exclusions, then
rank,group and sort the
results

Concurrent Query Testing
Out of the Box Settings(secs)
0 20 40 60 80 100 120
User #1
User #2
User #3
User #4
User #5
User #6
User #7
User #8
User #9
User #10
User #11
User #12

DBA Restricting DOP(secs)
0 20 40 60 80 100 120
User #1
User #2
User #3
User #4
User #5
User #6
User #7
User #8
User #9
User #10
User #11
User #12

Query Queuing(secs)
0 20 40 60 80 100 120
User #1
User #2
User #3
User #4
User #5
User #6
User #7
User #8
User #9
User #10
User #11
User #12

Out of the Box Fixed DoP With Queuing
User 1 30 24 10
User 2 33 27 11
User 3 34 27 11
User 4 39 29 11
User 5 40 29 15
User 6 41 29 19
User 7 43 29 21
User 8 47 28 16
User 9 49 30 25
User 10 106 28 27
User 11 108 26 25
User 12 112 27 26
Average 57 28 18

1 Terabyte Loaded and Ready To Go In 20 Minutes
Operation Time
Create Tablespaces and Initial Load 0:39
Initial 1TB Load 9:55
Gather Statistics 3:36
Daily Incremental Load 1:44
Referential Integrity Check 0:51
Transform Data 1:09
Exchange and Incremental Statistics 0:22
Query from Hell 0:32
Total 18:48

1 Terabyte Loaded and Ready To Go In 20 Minutes
0:39
9:55
3:36
1:44
0:51
1:09
0:22
0:32
Create Tablespaces and run DDL
Initial 1TB Load
Gather Statistics
Daily Incremental Load
Referential Integrity Check
Transform Data
Exchange and Incremental Statistics
Query from Hell

The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.

Real World Performance - Data Warehouses

More Related Content

What's hot

Similar to Real World Performance - Data Warehouses

More from Connor McDonald

Recently uploaded

Real World Performance - Data Warehouses