SlideShare a Scribd company logo
1 of 49
Luca Canali – CERN
Marcin Blaszczyk – CERN

UKOUG Conference, Birmingham, 3rd-5th December 2012
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
CERN
•
•
•
•
•

European Organization for Nuclear Research founded in 1954
20 Member States, 7 Observer States + UNESCO and UE
60 Non-member States collaborate with CERN
2400 staff members work at CERN as personnel, 10 000 more researchers from
institutes world-wide
5 Nobel Laureates

5
LHC and Experiments
•

Large Hadron Collider (LHC) –
particle accelerator used to
collide beams at very high energy
•
•
•

•

•

27 km long circular tunnel
Located ~100m underground
Protons currently travel at 99.9999972%
of the speed of light

Collisions are analysed with usage
of special detectors and software in
the experiments dedicated to LHC
New particle discovered!
consistent with the Higgs Boson

6
WLCG
•

The world‟s largest computing grid
More than 20 Petabytes
of data stored and analysed
every year
Over 68 000 physical CPUs
Over 305 000 logical CPUs
157 computer centres in 36
countries
More than 8000 physicists with
real-time access to LHC data
7
Oracle at CERN
•

Relational DBs play a key role in the LHC production chains
•

Accelerator logging and monitoring systems
Online acquisition, offline: data (re)processing, data
distribution, analysis
Grid infrastructure and operation services

•
•
•

•

Data management services
•

•

Monitoring, dashboards, etc.
File catalogues, file transfers, etc.

Metadata and transaction processing for tape storage system

8
CERN‟s Databases
•

~100 Oracle databases, most of them RAC
•
•

•

Examples of critical production DBs:
•
•

•

Mostly NAS storage plus some SAN with ASM
~300 TB of data files for production DBs in total

LHC logging database ~140 TB, expected growth up to ~70 TB / year
13 Production experiments‟ databases ~120 TB in total

15 Data Guard RAC clusters in Prod
•

Active Data Guards since upgrade to 11g

Redo Transport

10
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
Active Data Guard – the Basics

•

ADG enables read only access to the physical standby
12
Active Data Guard @CERN
•

Replication
•

•

•

Load Balancing
•
•

•

Inside CERN from online (data acquisition) to offline (analysis)
Replication to remote sites (being considered)
Offloading queries
Offloading backups

Features available also with Data Guard
•
•
•

Disaster recovery
Duplication of databases
Others
13
Architecture We Use
Maximum performance

Maximum performance

Active Data Guard
for users’ access
Redo Transport

Primary
Database

Active Data Guard
for users’ access and
for disaster recovery

Primary
Database

Active Data Guard
for disaster recovery

1. Low load ADG

2. Busy & critical ADG

LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL
ASYNC NOAFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
DB_UNIQUE_NAME=<standby_db_unique_name>’
14
Deployment Model
•

Production RAC systems
•
•

•

Number of nodes: 2 - 5
NAS storage with SSD cache

(Active) Data Guard RAC
•
•
•

Number of nodes: 2 - 4
Sized for average production load
ASM on commodity HW, recycle „old production‟
15
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
Replication Use Cases
•

Number of replication use cases
•
•

•

Inside CERN
Across the Grid

So far handled by Oracle Streams
•
•

Logical replication with messages called LCRs (Logical Change Record)
Service started with 10gR1, now 11gR2

PROPAGATION

Redo Transport

17
ADG vs. Streams for Our Use Cases
Streams

Active Data Guard
•

Block level replication
•
•

•
•
•
•

•

•
•

More robust & Lower latency
Full database replica

Less maintenance effort 
More use cases than just
replication 
Complex replication cases
not covered 
Read-only replica 

SQL based replication

•
•
•
•

More flexible
Data selectivity

Replica is accessible for
read-write load 
Unsupported data types 
Throughput limitations 
Can be complicated to
recover 
18
Offloading Queries to ADG
•

Workload distribution
•
•
•

Transactional workload runs on primary
Read-only workload can be moved to ADG
Read-mostly workload
•

•

DMLs can be redirected to remote database with a dblink

Examples of workload on our ADGs:
•

Ad-hoc queries, analytics and long-running
reports, parallel queries, unpredictable workload and
test queries
19
Offloading Backups to ADG
•

Significantly reduces load on primary
•

•

Removes sequential I/O of full backup

ADG has great improvement for VLDBs
•

allows usage of block change tracking for fast
incremental backups

20
Disaster Recovery
•

We have been using it since a few years
•
•

•
•

Switchover/failover is our first line of defense
Saved the day already for production services

Disaster recovery site at ~10 km from our
datacenter
In the future remote site in Hungary

* Active Data Guard option not required
21
Duplicate for Upgrade
4
2
5
1
6
3
DATABASE downtime

RDBMS upgrade
Upgrade complete!
Clusterware 10g
+
RDBMS 10g

Redo Transport
Redo Transport

Clusterware 11g
+
RDBMS 10g
11g

22
Duplicate for Upgrade - Comments
•

Risk is limited 
•
•
•
•

•

Downtime is limited 
•

•

Fresh installation of the new clusterware
Old system stays untouched
Allows full upgrade test
Allows stress testing of new system
~ 1h for RDBMS upgrade

Additional hardware is required 
•

Only for the time of the upgrade

* Active Data Guard option not required
23
More Use Cases
•

Load testing
•

•

Recover objects against logical corruption
•

•

Human errors

Leverage flashback logs
•

•

Real application testing (RAT)

Additional writes may have the negative impact on
production database

Data lifecycle activities
24
Snapshot Standby
•

Facilitates opening standby in read-write mode
•

Number of steps is limited to two
SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;
SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

•

•

Restore point created internally by Oracle

Primary database is always protected
•

Redo is being received when standby is opened
25
Data Lifecycle Activities
•

Transport big data volumes from busy production DBs
•

•

Transportable Tablespaces (TTS), Data Pump
Challenge: TTS requires read only tablespaces

Redo Transport

Transportable Tablespaces
Data Pump

SQL> ALTER TABLESPACE data_2011q1 READ ONLY;
data_2011q4
data_2011q2
data_2011q3
26
Custom Monitoring
•

Monitoring
•
•
•

•

Availability
Latency
+ Automatic MRP restart

Racmon

27
Custom Monitoring
•

Strmmon

28
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
Sharing Production Experience
•

ADG in production since Q1 2012
•

•

Oracle 11.2.0.3, OS: RHEL5 64 bit

In the following:
•
•

A few thoughts of how ADG is working in our
environment
Sharing experience from operations

30
ADG for Replication
•

More robust than Streams
•
•

•

Fewer incidents, less configuration, less manpower
Users like the low latency

More complex replication setups
•

Still on Streams

31
Offloading Backups to ADG
•

Problem: OLTP production slow during backup
•

Latency for random IO is critical to this system
• degraded by backup sequential IO

•

It‟s a storage issue:
• Midrange NAS storage

32
Number of waits

Details of Storage Issue on Primary

Very slow reads appear

Reads from SSD cache go to zero
33
Custom Backup Implementation
•

We have modified our backup implementation
and adapted it for ADG
•
•

It was worth the effort for us.
Many details to deal with.
• One example: archive log backup and deletion policies
• Google “Szymon Skorupinski CERN openworld 2012” for

details
34
Backups with RMAN
•

We find backups with RMAN still a very good idea
•

•

Automatic restore and recovery system
•

•

Data Guard is not a backup solution
Periodic checks that backups can be recovered

Block change tracking on ADG
•

•

Incremental backups only read changed blocks
Highly beneficial for backup performance
35
Redo Apply Throughput
•

ADG redo apply by MRP
•
•

Noticeable improvement in speed in 11g vs. 10g
We see up to ~100 MB/s of redo processed
• Much higher than typical redo generation in our DBs

•

Note on our configuration:
•
•

Standby logfiles, real time apply + force logging
Archiving to ADG using ASYNC NOAFFIRM
36
Redo Apply Latency
•

Latency to ADG normally no more than 1 sec
•

Apply latency spikes: ~20 sec, depends on workload
•

•

Latency builds up when ADG is creating big datafiles
•

•

•

Redo apply slaves wait for checkpoint completed (11g)

One redo apply slave creates file, rest of slaves wait (11g)

MRP (recovery) can stop with errors or crashes

Possible workarounds:
•

Minimize redo switch by using large redo log files
• Create datafiles with small initial size and then (auto)extend
37
Latency and Applications
•

Currently alarms to on-call DBAs
•

•

When latency primary-ADG greater than programmable

What if we had to guarantee latency
•

Applications maybe would need to be modified to make
them latency-aware
Useful techniques to know

•
•

ALTER SESSION SET STANDBY_MAX_DATA_DELAY=<n>;

•

ALTER SESSION SYNC WITH PRIMARY; – SYNC only!
38
Administration
•
•
•

In most cases ADG doesn‟t require much DBA
time after initial setup
Not much different than handling production DBs
Although we see bugs and issues specific to ADG
•

Two examples in following slides

39
Failing Queries on ADG
•

Sporadic ORA-1555 (snapshot too old) on ADG
•

Normal behaviour if queries need read consistent
images outside undo retention threshold
• This case is anomalous: short queries
• Under investigation, seems a bug
• Example:
ORA-01555 caused by SQL statement below (SQL
ID: …, Query Duration=1 sec, SCN: …)
40
Stuck Recovery and Lost Write
•

Issue: ORA-600 [3020] “Redo is inconsistent with data block”
•

•
•
•

Analysis:
•
•
•

•

Critical production with 2 ADGs.
Both ADGs stop applying redo at the same time... on a Friday night!
Primary keeps working

Possible cause: lost write on primary
Corruption in one index on primary - rebuilt online by DBA on-call
Is the root cause an HW issue or an Oracle bug?

Impact:
•

Redo apply on ADG is stuck while issue is being fixed by DBA
41
Some Thoughts on Lost Write Issue
•

Complex recovery case
•

•

How to be proactive?
•

Setting parameter DB_LOST_WRITE_PROTECT=TYPICAL
•
•

•

See also note: Resolving ORA-752 or ORA-600 [3020] During
Standby Recovery [ID 1265884.1]

It is a warning mechanism not a fix for lost write
Impact: just a few percent increase in redo log writes

Should I worry that my primary DB and ADG are not
fully synchronized?
•
•

Not easy to check
Note ADG and primary are not exact binary copies
42
Plans for the Future
•

Deploy ADG for replication to remote sites
•

Source at CERN, ADG at another grid site
• ADGs under administration by remote sites‟ DBAs

•

Challenges:
•
•
•

Manage heterogeneous environment
Redo transport over WAN
Maintain a partial standby
43
Features Under Investigation
•

Data Guard Broker
•

•
•
•
•

fast-start failover

Partial standby configuration
Cascading standby
Synchronous redo transport
ADG in 12c
44
Outline
•
•
•
•
•

CERN and Oracle
Architecture
Use Cases for ADG@CERN
Our experience with ADG and
lessons learned
Conclusions
Conclusions
•

Active Data Guard provides added value to our database services
•

•

In particular ADG has helped us
•
•

•

•

Although requires some additional effort in monitoring
Found a few bugs on the way

We find ADG mature
•

•

Strengthening our replication deployments
Offloading production workload, including backups

We find ADG has low maintenance overhead
•

•

Almost 1 year of production experience

Although we also look forward to additional improvements in next releases

We are planning to extend our usage of ADG
46
Acknowledgements
•

CERN Database Group
• Experiments Community at CERN
• Oracle contacts: Monica Marinucci, Greg
Doherty, Larry Carpenter for discussions

47
Luca.Canali@cern.ch
Marcin.Blaszczyk@cern.ch
Active Data Guard @CERN on UKOUG 2012

More Related Content

What's hot

Scalable HiveServer2 as a Service
Scalable HiveServer2 as a ServiceScalable HiveServer2 as a Service
Scalable HiveServer2 as a Service
DataWorks Summit
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
Nicolas Poggi
 

What's hot (20)

Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Scalable HiveServer2 as a Service
Scalable HiveServer2 as a ServiceScalable HiveServer2 as a Service
Scalable HiveServer2 as a Service
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
 

Similar to Active Data Guard @CERN on UKOUG 2012

OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
Enkitec
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
UniFabric
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Srimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBASrimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBA
SRIMANTA MAJI
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Oracle Cloud DBaaS
Oracle Cloud DBaaSOracle Cloud DBaaS
Oracle Cloud DBaaS
Arush Jain
 

Similar to Active Data Guard @CERN on UKOUG 2012 (20)

Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Education
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Mma 10g r2_936
Mma 10g r2_936Mma 10g r2_936
Mma 10g r2_936
 
Oracle DataGuard Online Training in USA | INDIA
Oracle DataGuard Online Training in USA | INDIAOracle DataGuard Online Training in USA | INDIA
Oracle DataGuard Online Training in USA | INDIA
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Srimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBASrimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBA
 
Oracle database 12c introduction- Satyendra Pasalapudi
Oracle database 12c introduction- Satyendra PasalapudiOracle database 12c introduction- Satyendra Pasalapudi
Oracle database 12c introduction- Satyendra Pasalapudi
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Oracle Cloud DBaaS
Oracle Cloud DBaaSOracle Cloud DBaaS
Oracle Cloud DBaaS
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Active Data Guard @CERN on UKOUG 2012

  • 1.
  • 2. Luca Canali – CERN Marcin Blaszczyk – CERN UKOUG Conference, Birmingham, 3rd-5th December 2012
  • 3. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 4. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 5. CERN • • • • • European Organization for Nuclear Research founded in 1954 20 Member States, 7 Observer States + UNESCO and UE 60 Non-member States collaborate with CERN 2400 staff members work at CERN as personnel, 10 000 more researchers from institutes world-wide 5 Nobel Laureates 5
  • 6. LHC and Experiments • Large Hadron Collider (LHC) – particle accelerator used to collide beams at very high energy • • • • • 27 km long circular tunnel Located ~100m underground Protons currently travel at 99.9999972% of the speed of light Collisions are analysed with usage of special detectors and software in the experiments dedicated to LHC New particle discovered! consistent with the Higgs Boson 6
  • 7. WLCG • The world‟s largest computing grid More than 20 Petabytes of data stored and analysed every year Over 68 000 physical CPUs Over 305 000 logical CPUs 157 computer centres in 36 countries More than 8000 physicists with real-time access to LHC data 7
  • 8. Oracle at CERN • Relational DBs play a key role in the LHC production chains • Accelerator logging and monitoring systems Online acquisition, offline: data (re)processing, data distribution, analysis Grid infrastructure and operation services • • • • Data management services • • Monitoring, dashboards, etc. File catalogues, file transfers, etc. Metadata and transaction processing for tape storage system 8
  • 9.
  • 10. CERN‟s Databases • ~100 Oracle databases, most of them RAC • • • Examples of critical production DBs: • • • Mostly NAS storage plus some SAN with ASM ~300 TB of data files for production DBs in total LHC logging database ~140 TB, expected growth up to ~70 TB / year 13 Production experiments‟ databases ~120 TB in total 15 Data Guard RAC clusters in Prod • Active Data Guards since upgrade to 11g Redo Transport 10
  • 11. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 12. Active Data Guard – the Basics • ADG enables read only access to the physical standby 12
  • 13. Active Data Guard @CERN • Replication • • • Load Balancing • • • Inside CERN from online (data acquisition) to offline (analysis) Replication to remote sites (being considered) Offloading queries Offloading backups Features available also with Data Guard • • • Disaster recovery Duplication of databases Others 13
  • 14. Architecture We Use Maximum performance Maximum performance Active Data Guard for users’ access Redo Transport Primary Database Active Data Guard for users’ access and for disaster recovery Primary Database Active Data Guard for disaster recovery 1. Low load ADG 2. Busy & critical ADG LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL ASYNC NOAFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<standby_db_unique_name>’ 14
  • 15. Deployment Model • Production RAC systems • • • Number of nodes: 2 - 5 NAS storage with SSD cache (Active) Data Guard RAC • • • Number of nodes: 2 - 4 Sized for average production load ASM on commodity HW, recycle „old production‟ 15
  • 16. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 17. Replication Use Cases • Number of replication use cases • • • Inside CERN Across the Grid So far handled by Oracle Streams • • Logical replication with messages called LCRs (Logical Change Record) Service started with 10gR1, now 11gR2 PROPAGATION Redo Transport 17
  • 18. ADG vs. Streams for Our Use Cases Streams Active Data Guard • Block level replication • • • • • • • • • More robust & Lower latency Full database replica Less maintenance effort  More use cases than just replication  Complex replication cases not covered  Read-only replica  SQL based replication • • • • More flexible Data selectivity Replica is accessible for read-write load  Unsupported data types  Throughput limitations  Can be complicated to recover  18
  • 19. Offloading Queries to ADG • Workload distribution • • • Transactional workload runs on primary Read-only workload can be moved to ADG Read-mostly workload • • DMLs can be redirected to remote database with a dblink Examples of workload on our ADGs: • Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries 19
  • 20. Offloading Backups to ADG • Significantly reduces load on primary • • Removes sequential I/O of full backup ADG has great improvement for VLDBs • allows usage of block change tracking for fast incremental backups 20
  • 21. Disaster Recovery • We have been using it since a few years • • • • Switchover/failover is our first line of defense Saved the day already for production services Disaster recovery site at ~10 km from our datacenter In the future remote site in Hungary * Active Data Guard option not required 21
  • 22. Duplicate for Upgrade 4 2 5 1 6 3 DATABASE downtime RDBMS upgrade Upgrade complete! Clusterware 10g + RDBMS 10g Redo Transport Redo Transport Clusterware 11g + RDBMS 10g 11g 22
  • 23. Duplicate for Upgrade - Comments • Risk is limited  • • • • • Downtime is limited  • • Fresh installation of the new clusterware Old system stays untouched Allows full upgrade test Allows stress testing of new system ~ 1h for RDBMS upgrade Additional hardware is required  • Only for the time of the upgrade * Active Data Guard option not required 23
  • 24. More Use Cases • Load testing • • Recover objects against logical corruption • • Human errors Leverage flashback logs • • Real application testing (RAT) Additional writes may have the negative impact on production database Data lifecycle activities 24
  • 25. Snapshot Standby • Facilitates opening standby in read-write mode • Number of steps is limited to two SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY; SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY; • • Restore point created internally by Oracle Primary database is always protected • Redo is being received when standby is opened 25
  • 26. Data Lifecycle Activities • Transport big data volumes from busy production DBs • • Transportable Tablespaces (TTS), Data Pump Challenge: TTS requires read only tablespaces Redo Transport Transportable Tablespaces Data Pump SQL> ALTER TABLESPACE data_2011q1 READ ONLY; data_2011q4 data_2011q2 data_2011q3 26
  • 29. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 30. Sharing Production Experience • ADG in production since Q1 2012 • • Oracle 11.2.0.3, OS: RHEL5 64 bit In the following: • • A few thoughts of how ADG is working in our environment Sharing experience from operations 30
  • 31. ADG for Replication • More robust than Streams • • • Fewer incidents, less configuration, less manpower Users like the low latency More complex replication setups • Still on Streams 31
  • 32. Offloading Backups to ADG • Problem: OLTP production slow during backup • Latency for random IO is critical to this system • degraded by backup sequential IO • It‟s a storage issue: • Midrange NAS storage 32
  • 33. Number of waits Details of Storage Issue on Primary Very slow reads appear Reads from SSD cache go to zero 33
  • 34. Custom Backup Implementation • We have modified our backup implementation and adapted it for ADG • • It was worth the effort for us. Many details to deal with. • One example: archive log backup and deletion policies • Google “Szymon Skorupinski CERN openworld 2012” for details 34
  • 35. Backups with RMAN • We find backups with RMAN still a very good idea • • Automatic restore and recovery system • • Data Guard is not a backup solution Periodic checks that backups can be recovered Block change tracking on ADG • • Incremental backups only read changed blocks Highly beneficial for backup performance 35
  • 36. Redo Apply Throughput • ADG redo apply by MRP • • Noticeable improvement in speed in 11g vs. 10g We see up to ~100 MB/s of redo processed • Much higher than typical redo generation in our DBs • Note on our configuration: • • Standby logfiles, real time apply + force logging Archiving to ADG using ASYNC NOAFFIRM 36
  • 37. Redo Apply Latency • Latency to ADG normally no more than 1 sec • Apply latency spikes: ~20 sec, depends on workload • • Latency builds up when ADG is creating big datafiles • • • Redo apply slaves wait for checkpoint completed (11g) One redo apply slave creates file, rest of slaves wait (11g) MRP (recovery) can stop with errors or crashes Possible workarounds: • Minimize redo switch by using large redo log files • Create datafiles with small initial size and then (auto)extend 37
  • 38. Latency and Applications • Currently alarms to on-call DBAs • • When latency primary-ADG greater than programmable What if we had to guarantee latency • Applications maybe would need to be modified to make them latency-aware Useful techniques to know • • ALTER SESSION SET STANDBY_MAX_DATA_DELAY=<n>; • ALTER SESSION SYNC WITH PRIMARY; – SYNC only! 38
  • 39. Administration • • • In most cases ADG doesn‟t require much DBA time after initial setup Not much different than handling production DBs Although we see bugs and issues specific to ADG • Two examples in following slides 39
  • 40. Failing Queries on ADG • Sporadic ORA-1555 (snapshot too old) on ADG • Normal behaviour if queries need read consistent images outside undo retention threshold • This case is anomalous: short queries • Under investigation, seems a bug • Example: ORA-01555 caused by SQL statement below (SQL ID: …, Query Duration=1 sec, SCN: …) 40
  • 41. Stuck Recovery and Lost Write • Issue: ORA-600 [3020] “Redo is inconsistent with data block” • • • • Analysis: • • • • Critical production with 2 ADGs. Both ADGs stop applying redo at the same time... on a Friday night! Primary keeps working Possible cause: lost write on primary Corruption in one index on primary - rebuilt online by DBA on-call Is the root cause an HW issue or an Oracle bug? Impact: • Redo apply on ADG is stuck while issue is being fixed by DBA 41
  • 42. Some Thoughts on Lost Write Issue • Complex recovery case • • How to be proactive? • Setting parameter DB_LOST_WRITE_PROTECT=TYPICAL • • • See also note: Resolving ORA-752 or ORA-600 [3020] During Standby Recovery [ID 1265884.1] It is a warning mechanism not a fix for lost write Impact: just a few percent increase in redo log writes Should I worry that my primary DB and ADG are not fully synchronized? • • Not easy to check Note ADG and primary are not exact binary copies 42
  • 43. Plans for the Future • Deploy ADG for replication to remote sites • Source at CERN, ADG at another grid site • ADGs under administration by remote sites‟ DBAs • Challenges: • • • Manage heterogeneous environment Redo transport over WAN Maintain a partial standby 43
  • 44. Features Under Investigation • Data Guard Broker • • • • • fast-start failover Partial standby configuration Cascading standby Synchronous redo transport ADG in 12c 44
  • 45. Outline • • • • • CERN and Oracle Architecture Use Cases for ADG@CERN Our experience with ADG and lessons learned Conclusions
  • 46. Conclusions • Active Data Guard provides added value to our database services • • In particular ADG has helped us • • • • Although requires some additional effort in monitoring Found a few bugs on the way We find ADG mature • • Strengthening our replication deployments Offloading production workload, including backups We find ADG has low maintenance overhead • • Almost 1 year of production experience Although we also look forward to additional improvements in next releases We are planning to extend our usage of ADG 46
  • 47. Acknowledgements • CERN Database Group • Experiments Community at CERN • Oracle contacts: Monica Marinucci, Greg Doherty, Larry Carpenter for discussions 47

Editor's Notes

  1. Speed of protons at 4TeV (4000 Gev) relative to speed of light using E(v)=m(0)*c^2*1/sqrt(1-(v/c)^2) and m(0)=0.938272029 GeV for protonsselect sqrt(1-1/power(4000/0.938272029,2)) from dual;
  2. MRP slave processes prx slaves wait for checkpoint complete at each redo log switch. This is show by ash:select * from gv$active_session_history where event=&apos;checkpoint completed&apos;;
  3. Note data blocks in ADG don’t receive (commit time) cleanout by default in 11g as redo for that is not logged (unless alter system set &quot;_log_committime_block_cleanout&quot;=true scope=memory sid=&apos;*&apos;;)So cleanout is done on ADG each time block is read. 11gr2 has concept of minimum active SCN to reduce work needed for block cleanout
  4. Example: commit cleanout is not logged on primary by default unless &quot;_log_committime_block_cleanout&quot;=true is set (default is false)