Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Under the Hood of the Smartest Availability Features in Oracle's Autonomous Database

586 views

Published on

This presentation discusses details of the smartest High Availability (HA) features in Oracle's Autonomous Databases. It also explains how those features are integrated in the various stages of the journey to the Autonomous Database. This presentation was first presented during Collaborate18 / #C18LV together with Maria Colgan (@SQLmaria).

Published in: Software
  • Be the first to comment

Under the Hood of the Smartest Availability Features in Oracle's Autonomous Database

  1. 1. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Under the Hood of the Smartest Availability Features in Oracle's Autonomous Database Maria Colgan – Master Principal Product Management, Oracle Database Markus Michalewicz – Senior Director of Product Management, Database HA & Scalability April 23, 2018
  2. 2. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 3
  3. 3. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Overview Database Smart Features Smarter on Engineered Systems “Autonomous Database Smart” 1 2 3 4 4
  4. 4. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Overview Database Smart Features Smarter on Engineered Systems “Autonomous Database Smart” 1 2 3 4 5
  5. 5. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle Autonomous Database Highlights 6 Self-Driving Automates database and infrastructure management, monitoring, tuning Self-Securing Protects from both external attacks and malicious internal users Self-Repairing Protects from all downtime including planned maintenance Enabled by Applied Machine Learning
  6. 6. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 7 Automatic Columnar Cache Autonomous Health Framework Automatic Diagnostic Framework Automatic Refresh of Database Clone Automatic Capture of SQL Monitor Automatic Data Optimization Automatic Workload Replay Automatic Storage Indexes Automatic SQL Tuning Automatic Segment Space Management Automatic Statistics Gathering Automatic Storage Management(ASM) Automatic Workload Repository (AWR) Automatic DB Diagnostic Monitor (ADDM) Automatic Memory Management Automatic Undo Management Automatic Query Rewrite Journey to Autonomous Database Oracle has invested thousands of engineer years automating key database functions
  7. 7. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Some Automatic Features Will Become “Smart Features” 8 •  “Smart Features” are automatic features that are executed as needed using real time analysis of data at the moment of execution •  Examples: – Automatic Data Optimization, SQL Plan Management, Hang Manager – Recovery Buddies, Smart Fencing – Autonomous Health Framework features such as Cluster Health Advisor
  8. 8. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 9 Journey to Autonomous Database Database Appliance Exadata Thousands of engineer years automating and optimizing database infrastructure
  9. 9. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Autonomous Completes the Journey 10 Autonomous Database Automated Data Center and Database Operations Expanded Infrastructure Automation Expanded Database Automation Oracle Cloud
  10. 10. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Automation 11 Engineered Systems Oracle Cloud Three Components to Ensure Success Automated DC & DB Operations Expanded Infrastructure Automation Expanded Database Automation
  11. 11. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Overview Database Smart Features Smarter on Engineered Systems “Autonomous Database Smart” 1 2 3 4 12
  12. 12. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Automatic Data Optimization 13
  13. 13. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Policy 3 Policy 2 Automatic Data Optimization Heat Map •  An in-memory heat map tracks disk based block and segment access –  Heat map is periodically written to storage –  Data is accessible by views or stored procedures •  Users can attach policies to automatically manage segments based on access –  Tables, Partitions or Sub-partitions can be moved in and out of the In-Memory Column Store, between storage tiers and compression levels –  Online, no impact to data availability –  It is NOT an archive and purge solution •  Part of the Advanced Compression Option Policy 1 14
  14. 14. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Automatic Data Optimization – Defining Policies •  Policy action include: •  set (NO) INMEMORY or MEMCOMPRESS level •  Advanced compress levels •  Tier data to lower cost storage •  Policy criteria include: •  after <time> of no access •  after <time> of creation •  after <time> of no modification •  on <user defined boolean function> •  Actions run in maintenance window •  Also possible to run policies manually •  dbms_ilm.execute_ilm procedure 15 ALTER TABLE sales ILM ADD POLICY … inmemory after 1 days of creation; No inmemory after 30 days of creation; memcompress for capacity after 3 days of no modification; compress for archive high after 90 days of no access; Tier to medium_storage_ts on MyCustomPolicyFunction;
  15. 15. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | SQL Plan Management 16 EMP DEPT HASH
  16. 16. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Plan Stability is Critical For Predictable Performance •  Unpredictable changes can happen to an execution plan •  Avoiding plan changes is the only method to avoid performance regression – Lock statistics to prevent them from changing does guarantee the plan won’t change – Freezing an execution plan with a Stored Outline, which have been deprecated! – No mechanism for plans to evolve! 17 •  Solution use SQL Plan Management – Optimizer automatically manages ‘execution plans’ •  Only known and verified plans are used – Plan changes are verified •  Only comparable or better plans are used going forward Available in 18c Standard Edition
  17. 17. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | How SPM Works 18 SELECT count(empno) tot FROM emp e, dept d WHERE e.deptno=d.deptno AND d.dname=’SALES’; SQL statement is submitted 1 Plan history Plan baseline NL EMP DEPT During hard parse Optimizer determines execution plan 2 Acceptable plan Execute Before execution, the plan is compared to the plan in the baseline to confirm it’s acceptable 3 NL EMP DEPT
  18. 18. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Plan history Plan baseline How SPM Works 19 HJ EMP DEPT During hard parse Optimizer determines execution plan 2 Plan Unacceptable NL EMP DEPT Before execution, the plan is compared to the plan in the baseline to confirm it’s acceptable 3 HJ EMP DEPT If the plan does not match an accepted plan in the SQL plan baseline it is added to the plan baseline but not executed 4 SQL statement is submitted 1 SELECT count(empno) tot FROM emp e, dept d WHERE e.deptno=d.deptno AND d.dname=’SALES’;
  19. 19. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Plan history Plan baseline How SPM Works 20 NL EMP DEPT HJ EMP DEPT NL EMP DEPT Execute Acceptable plan SELECT count(empno) tot FROM emp e, dept d WHERE e.deptno=d.deptno AND d.dname=’SALES’; Only an accepted plan will be use 5
  20. 20. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Automatic SQL Plan Management •  New evolve auto task running in the maintenance window – Ranks all non-accepted plans and runs evolve process for them • Newly found plans are ranked the highest – If new plan performs better than existing plan it is automatically accepted – If new plan performs worse than existing plan it will remain unaccepted – Poor performing plans will not be retried for 30 days and then only if the statement is active – New task is SYS_AUTO_SPM_EVOLVE_TASK – Information on task found in DBA_ADVISOR_TASKS – Use DBMS_SPM.REPORT_AUTO_EVOLVE_TASK to view results of the auto job 21
  21. 21. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Automatic Evolution Task Report EXECUTE :evol_out := DBMS_SPM.REPORT_AUTO_EVOLVE_TASK(type=> ‘TEXT’); SELECT :evol_out FROM DUAL; 22
  22. 22. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Database Hang Manager 23
  23. 23. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 24 Introduction to Hang Manager How it works Session DIAG0 EVALUATE DETECT ANALYZE Hung? VERIFY Victim QoS Policy •  Always on, as enabled by default •  Reliably detects database hangs –  Including cross-layer hangs between ASM & DB •  Automatically resolves hangs •  Supports QoS Performance Classes, Ranks and Policies to maintain SLAs •  Logs all detected hangs & their resolutions •  New in 18c: Resolves Deadlocks
  24. 24. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 25 Hang Manager Optimizations with Oracle RAC 12c (Rel. 2) Tuning under the hood •  Hang Manager auto-tunes itself by periodically collecting instance-and cluster-wide hang statistics •  Metrics like cluster health/instance health is tracked over a moving average •  This moving average is considered during resolution •  Holders waiting on SQL*Net break/reset are fast tracked
  25. 25. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 26 Full Resolution Dump Trace File and DB Alert Log Audit Reports Hang Manager 2015-10-13T16:47:59.435039+17:00 Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353): ORA-32701: Possible hangs up to hang ID=1 detected Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc 2015-10-13T16:47:59.506775+17:00 DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2     due to a GLOBAL, HIGH confidence hang with ID=1.     Hang Resolution Reason: Automatic hang resolution was performed to free a    significant number of affected sessions. DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1. In the alert log on the instance local to the session (instance 2 in this case), we see the following: 2015-10-13T16:47:59.538673+17:00 Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753): ORA-32701: Possible hangs up to hang ID=1 detected Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc 2015-10-13T16:48:04.222661+17:00 DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1     requested by master DIA0 process on instance 1     Hang Resolution Reason: Automatic hang resolution was performed to free a    significant number of affected sessions.     by terminating session sid:40 with serial # 43179 (ospid:13031) Hang detected by hang manager Session victim identified & requested termination Blocker session terminated Session EVALUATE DETECT ANALYZE Hung? QoS Policy DIAG0 VERIFY Victim Elapsed time: ~5.3 secs.
  26. 26. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 27 DBMS_HANG_MANAGER.Sensitivity A new SQL interface to set Hang Manager sensitivity Hang Sensitivity Level Description Note NORMAL Hang Manager uses its default internal operating parameters to try to meet typical requirements for any environments Default HIGH Hang Manager is more alert to sessions waiting in a chain than when sensitivity is in NORMAL level •  Early warning exposed via (V$ view) •  Sensitivity can be set higher –  If the default level is too conservative •  Hang Manager considers QoS policies and data during the validation process
  27. 27. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Recovery Buddies 28
  28. 28. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 29 •  Recovery Buddies •  Track block changes on buddy instance •  Quickly identify blocks requiring recovery during reconfiguration •  Allow rapid processing of transactions after failures Near Zero Reconfiguration Time with Recovery Buddies A.k.a. Buddy Instances
  29. 29. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 30 •  Buddy Instance mapping is simple (random) –  e.g. I1 à I2, I2 à I3, I3 à I4, I4 à I1 •  Recovery buddies are assigned during startup •  RMS0 on each recovery buddy instance maintains an in-memory area for redo log change •  An in-memory area is used during recovery –  Eliminates the need to physically read the redo •  Recovery Buddies is a smart feature that is enabled by default and executed at “best effort” Near Zero Reconfiguration Time with Recovery Buddies How it works under the hood Instance I1 Instance I2 Instance I3 Instance I4 Recovery Buddy I3 Recovery Buddy I4 Recovery Buddy I1 MyCluster Recovery Buddy I2
  30. 30. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | How Recovery Buddies Help Reducing Recovery Time Without Recovery Buddies With Recovery Buddies 31 Detect Evict Elect Recovery Read Redo Apply Recovery Detect Evict Elect Recovery Read Redo Apply Recovery Up to 4x faster
  31. 31. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Smart Fencing 32
  32. 32. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 33 •  Pre-12.2, node eviction follows a rather “ignorant” pattern –  Example in a 2-node cluster: The node with the lowest node number survives •  Customers must not base their application logic on which node survives the split brain –  As this may(!) change in future releases Node Eviction Basics http://www.slideshare.net/MarkusMichalewicz/oracle-clusterware-node-management-and-voting-disks ✔ 1 2
  33. 33. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 34 •  Node Weighting is a new feature that considers the workload hosted in the cluster during fencing –  Hence, called “Smart Fencing” •  The idea is to let the majority of work survive, if everything else is equal –  Example: In a 2-node cluster, the node hosting the majority of services (at fencing time) is meant to survive Node Weighting in Oracle RAC 12c Release 2 Idea: Everything equal, let the majority of work survive ✔ 1 2
  34. 34. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | A three node cluster will benefit from “Node Weighting”, if three equally sized sub-clusters are built as s result of the failure 35 Secondary failure consideration can influence which node survives Secondary failure consideration will be enhanced successively A fallback scheme is applied if considerations do not lead to an actionable outcome Let’s Define “Equal” ✔ Public network card failure “Conflict”
  35. 35. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | CSS_CRITICAL can be set on various levels / components to mark them as “critical” so that the cluster will try to preserve them in case of a failure 36 CSS_CRITICAL will be honored if no other technical reason prohibits survival of the node which has at least one critical component at the time of failure A fallback scheme is applied if CSS_CRITICAL settings do not lead to an actionable outcome CSS_CRITICAL – Fencing with Manual Override crsctl set server css_critical {YES|NO} + server restart srvctl modify database -help |grep critical … -css_critical {YES | NO} Define whether the database or service is CSS critical ✔ Node eviction despite WL; WL will failover “Conflict”
  36. 36. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Overview Database Smart Features Smarter on Engineered Systems “Autonomous Database Smart” 1 2 3 4 37
  37. 37. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Engineered Systems are well known configurations allowing for specialization 38 Engineered Systems provide hardware-assisted resilience Engineered Systems enable optimized software utilization Engineered Systems – Designed for Success “ “
  38. 38. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Well Known Configurations •  Oracle Database code uses less than 5% of OS-dependent code –  This makes the Oracle Database very portable, but limits specialization •  Engineered Systems use Oracle owned OS and well known (to the firmware) hardware –  Highly specialized configurations enable optimized software and hardware utilization 39 Protocol / Hardware (HW) Infiniband (IB) Converged Ethernet (CE) UDP •  UDP over IP over IB is generically supported on all HW •  UDP over CE will be supported generically as part of CE support RDS •  RDS over IB requires •  Oracle Linux + certain UEK versions •  Oracle branded HCA N/A RoCE N/A •  RDMA over CE (RoCE) will be supported on Engineered Systems only
  39. 39. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Hardware-Assisted Resilience •  Exadata uses IB Subnet Manager to inform higher level software layers about component failures –  Callouts eliminate waiting for timeouts –  Examples include, but are not limited to: •  Fast Node Death Detection •  Fast Cell Death Detection •  Exadata Database Machines use a special I/O Fencing mechanism based on the “diskmon” process, which –  Monitors and handles storage cell failures and I/O fencing –  Broadcasts intra database IORM (I/O Resource Manager) plans from databases to storage cells 40
  40. 40. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 41 Exafusion and Cache Fusion Accelerator reduce context switches Optimized Software Utilization •  Exafusion –  Exafusion provides lower latency & higher throughput via direct to wire block transfers between Oracle RAC instances –  Data is transferred directly from user space to the Infiniband network, leading to reduced CPU utilization and better scale- out performance •  Cache Fusion Accelerator –  The Cache Fusion Accelerator (CFA) is an OS kernel (Linux & Solaris only) module which can respond directly to certain lock requests via RDSv3 –  CFA saves user/kernel context switches, frees up CPU cycles in LMS, and “speeds up” messages –  CFA will be activated on Engineered Systems over time, including the Oracle Database Appliance
  41. 41. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 42 RDMA decreases messaging and CPU time Optimized Software Utilization Instance 1 Instance 2 Instance 3 UNDO UNDO RDMA RDMA •  Undo Block RDMA-read –  In some workloads, more than half of the remote reads are for Undo Blocks to satisfy read consistency –  Undo Block RDMA-read uses RDMA to directly and rapidly access UNDO blocks in remote instances •  Commit Cache –  The Commit Cache maintains an in-memory table on each instance which records the commit time of transactions –  Remote LMS directly reads the commit cache and sends back commit times for requested transactions •  Replaces having to send entire 8K transaction table block
  42. 42. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 43 In-Memory Duplication Optimized Software Utilization •  Similar to storage mirroring •  Duplicate in-memory columns on another node •  Enabled per table/partition •  E.g. only recent data •  Application transparent •  Downtime eliminated by using duplicate after failure •  Improved scalability by reading from both sides of the mirror
  43. 43. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Program Agenda Overview Database Smart Features Smarter on Engineered Systems “Autonomous Database Smart” 1 2 3 4 44
  44. 44. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 45 Powered by Machine Learning Autonomous Health Framework •  Oracle Autonomous Health Framework (AHF) was released with Oracle Database 12c Release •  Oracle AHF 18c extends Machine Learning to more utilities in the Framework such as –  Hang Manager –  Trace File Analyzer
  45. 45. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 46 Smarter with Applied Machine Learning Hang Manager •  Actual Internal and External customer data drives model development •  Purpose-built diagnostic technology used for knowledge extraction •  Expert Development team scrubs data •  Hang Heuristic Engine created and deployed @Customer •  HM uses run-time engine to perform real- time DB hang detection and resolution HM Dev Team ASH Knowledge Extraction Heuristic Engine Expert Supervision HM Runtime Engine Feedback HM HM Scrub Data
  46. 46. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 47 Smart Collection Trace File Analyzer •  Always on – Enabled by default •  Comprehensive first failure diag. collection •  Efficiently collects, packages and transfers diagnostic data to Oracle Support •  Transfers data to centralized storage for detailed analysis with TFA Service •  Supports Database 10.2 and above •  Included since 11.2.0.4 and 12.1.0.2 and updated in Patchsets & PSUs
  47. 47. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 48 Smarter with Applied Machine Learning Trace File Analyzer •  Machine Learning-based Knowledge Extraction of Logs, SRs and Bugs •  Expert training refines data training set •  Knowledge is embedded in run-time model •  Model is shipped in TFA Collector to work with the live logs on the Cluster •  Log anomaly detection is performed with TFA Receiver •  No model training required by user •  Model is updated regularly TFA Dev Team Bugs ML Knowledge Extraction Model Generation Expert Supervision TFA Runtime Model TFA Web SR TFA Receiver TFA Collector Scrub Data
  48. 48. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 49 … Because of “Expert Supervision” Oracle’s Autonomous Database is Smarter … Data ML Knowledge Extraction Model Generation Scrub Data TFA Dev Team Expert Supervision CHA Dev Team HM Dev Team MAA Dev Team http://oracle.com/goto/maa
  49. 49. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 50

×