Redefining tables online without surprisesNelson Calero
The Oracle database includes several features to allow moving data online, ie: without preventing users to access it when it is being moved (DML operation are not blocked).
One of those features is to change a table definition, using the package DBMS_REDEFINITION.
While moving a table is an online operation since version 12.2, redefinition is still needed for some changes. Also is needed in older versions.
In this session best practices will be shown based on experience of using it with big tablespaces, with examples covering all the steps needed to use DBMS_REDEFINITION under different scenarios, including the problems you can find, how to resolve them and how this process is different in version 11.2 and 12.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
This document provides an overview of HBase and why NoSQL databases like HBase were developed. It discusses how relational databases do not scale horizontally well with large amounts of data. HBase was created by Google to address these scaling issues and was inspired by their BigTable database. The document explains the HBase data model with rows, columns, and versions. It describes how data is stored physically in HFiles and served from memory and disks. Basic operations like put, get, and scan are also covered.
Apache HBase Improvements and Practices at XiaomiHBaseCon
Duo Zhang and Liangliang He (Xiaomi)
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
Looking under the covers: Using SNMP to peek inside ErlangDavid Dossot
This document discusses using SNMP to monitor Erlang applications and nodes. It provides an overview of SNMP, describes how to configure an Erlang SNMP agent with custom and standard objects, and demonstrates how to create dynamic tables and monitor various system metrics. Examples are given to monitor node names, gauge values, filesystem usage, and standard Erlang VM and OTP metrics. Source code is provided for setting up an Erlang SNMP agent and monitoring custom objects.
Redefining tables online without surprisesNelson Calero
The Oracle database includes several features to allow moving data online, ie: without preventing users to access it when it is being moved (DML operation are not blocked).
One of those features is to change a table definition, using the package DBMS_REDEFINITION.
While moving a table is an online operation since version 12.2, redefinition is still needed for some changes. Also is needed in older versions.
In this session best practices will be shown based on experience of using it with big tablespaces, with examples covering all the steps needed to use DBMS_REDEFINITION under different scenarios, including the problems you can find, how to resolve them and how this process is different in version 11.2 and 12.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
This document provides an overview of HBase and why NoSQL databases like HBase were developed. It discusses how relational databases do not scale horizontally well with large amounts of data. HBase was created by Google to address these scaling issues and was inspired by their BigTable database. The document explains the HBase data model with rows, columns, and versions. It describes how data is stored physically in HFiles and served from memory and disks. Basic operations like put, get, and scan are also covered.
Apache HBase Improvements and Practices at XiaomiHBaseCon
Duo Zhang and Liangliang He (Xiaomi)
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
Looking under the covers: Using SNMP to peek inside ErlangDavid Dossot
This document discusses using SNMP to monitor Erlang applications and nodes. It provides an overview of SNMP, describes how to configure an Erlang SNMP agent with custom and standard objects, and demonstrates how to create dynamic tables and monitor various system metrics. Examples are given to monitor node names, gauge values, filesystem usage, and standard Erlang VM and OTP metrics. Source code is provided for setting up an Erlang SNMP agent and monitoring custom objects.
Oracle rac cachefusion - High Availability Day 2015aioughydchapter
RAC Cache Fusion allows Oracle Real Application Clusters instances to share cached data in memory to avoid disk I/O and improve performance. Key aspects of Cache Fusion include global cache services coordinating cached data across instances, maintaining data consistency through modes and roles for cached blocks, and keeping past images of dirty blocks for recovery purposes. Cache blocks can be accessed locally or globally depending on their assigned role and mode.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
The document discusses two options for achieving high availability for Oracle Database Standard Edition 2 (SE2):
1) Standard Edition High Availability (SEHA) provides an out-of-the-box failover cluster configuration using Oracle Grid Infrastructure that supports automatic failover between two nodes.
2) Using refreshable pluggable databases (PDBs) allows cloning a PDB from a primary database to a secondary database for read-only reporting or to refresh the secondary PDB periodically to propagate changes.
In a real life almost any project deals with the
tree structures. Different kinds of taxonomies,
site structures etc require modeling of
hierarchy relations.
Typical approaches used
● Model Tree Structures with Child References
● Model Tree Structures with Parent References
● Model Tree Structures with an Array of Ancestors
● Model Tree Structures with Materialized Paths
● Model Tree Structures with Nested Sets
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
Discover the new features and capabilities of Scylla Open Source 5.0 directly from the engineers who developed it. This second block of lightning talks will cover the following topics:
- New IO Scheduler and Disk Parallelism
- Per-Service-Level Timeouts
- Better Workload Estimation for Backpressure and Out-of-Memory Conditions
- Large Partition Handling Improvements
- Optimizing Reverse Queries
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Smart monitoring how does oracle rac manage resource, state ukoug19Anil Nair
An important requirement for HA and to provide scalability is to detect problems and resolve them quickly before the user sessions get affected. Oracle RAC along with its Family of Solutions work together cohesively to detect conditions such as "Un-responsive Instances", Network issues quickly and resolve them by either redirecting the work to other instances or redundant network paths
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
ScyllaDB adopted Raft as a consensus protocol in order to dramatically improve our operational aspects as well as provide strong consistency to the end-user. This talk will explain how Raft behaves in Scylla Open Source 5.0 and introduce the first end-user visible major improvement: schema changes. Learn how cluster configuration resides in Raft, providing consistent cluster assembly and configuration management. This makes bootstrapping safer and provides reliable disaster recovery when you lose the majority of the cluster.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Anoop Sam John and Ramkrishna Vasudevan (Intel)
HBase provides an LRU based on heap cache but its size (and so the total data size that can be cached) is limited by Java’s max heap space. This talk highlights our work under HBASE-11425 to allow the HBase read path to work directly from the off-heap area.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
GNU Toolchain is the de facto standard of IT industrial and has been improved by comprehensive open source contributions. In this session, it is expected to cover the mechanism of compiler driver, system interaction (take GNU/Linux for example), linker, C runtime library, and the related dynamic linker. Instead of analyzing the system design, the session is use case driven and illustrated progressively.
Getting the Scylla Shard-Aware Drivers FasterScyllaDB
Alexys will explain how Scylla shard-aware drivers are implemented and why Scylla benefits from them. Taking the Python shard-aware driver as an example, Alexys will demonstrate how recent shard-aware drivers can leverage the new shard allocation algorithm that Scylla 4.3 brings to the table, and how to make use of it from a developer or administrator point of view. Alexys will showcase benefits from real production graphs observed at Numberly.
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
LiquiBase is an open source tool for tracking, managing and applying database changes, where database changes are stored in an XML file called a changelog that is executed to handle different revisions. It aims to provide consistent database changes across environments by managing databases at different states and keeping a history of all changes made through automatic rollback support and ability to effectively manage variable changes. Problems with manual database changes include inconsistent application of changes and databases becoming out of sync between environments.
The document describes Apache Hive hooks, which allow intercepting function calls or events during query execution in Hive. It provides details on the different hook points in Hive, including pre-execution, post-execution, and failure hooks. It also explains how to configure hooks by setting hook properties and the jar paths for hook implementations. Finally, it outlines the interfaces and contexts provided to hooks at each stage of query processing in Hive.
Patterns and Operational Insights from the First Users of Delta LakeDatabricks
Delta Lake was used to ingest and process large volumes of data, from tens of TB per day to hundreds of TB per day. Key patterns discussed include extracting and loading data from S3 into streaming tables, transforming the data through parsing, merging logic for debugging, using stateful transformations, building aggregation funnels, handling deduplication through SCD updates, optimizing storage and metadata for high scale tables, and addressing patterns like composability, schema ordering, partitioning, and handling conflicting transactions.
The document discusses Oracle Real Application Clusters (RAC) architecture and internals. A typical RAC configuration includes multiple nodes connected to a public network, interconnect, and shared storage. Oracle Grid Infrastructure manages the clusterware and Automatic Storage Management. It provides high availability of databases and other applications by enabling them to run on multiple nodes and utilize the shared storage. The document covers various RAC components like VIPs, listeners, SCAN, client connectivity, node membership, and the interconnect.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Oracle rac cachefusion - High Availability Day 2015aioughydchapter
RAC Cache Fusion allows Oracle Real Application Clusters instances to share cached data in memory to avoid disk I/O and improve performance. Key aspects of Cache Fusion include global cache services coordinating cached data across instances, maintaining data consistency through modes and roles for cached blocks, and keeping past images of dirty blocks for recovery purposes. Cache blocks can be accessed locally or globally depending on their assigned role and mode.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
The document discusses two options for achieving high availability for Oracle Database Standard Edition 2 (SE2):
1) Standard Edition High Availability (SEHA) provides an out-of-the-box failover cluster configuration using Oracle Grid Infrastructure that supports automatic failover between two nodes.
2) Using refreshable pluggable databases (PDBs) allows cloning a PDB from a primary database to a secondary database for read-only reporting or to refresh the secondary PDB periodically to propagate changes.
In a real life almost any project deals with the
tree structures. Different kinds of taxonomies,
site structures etc require modeling of
hierarchy relations.
Typical approaches used
● Model Tree Structures with Child References
● Model Tree Structures with Parent References
● Model Tree Structures with an Array of Ancestors
● Model Tree Structures with Materialized Paths
● Model Tree Structures with Nested Sets
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
Discover the new features and capabilities of Scylla Open Source 5.0 directly from the engineers who developed it. This second block of lightning talks will cover the following topics:
- New IO Scheduler and Disk Parallelism
- Per-Service-Level Timeouts
- Better Workload Estimation for Backpressure and Out-of-Memory Conditions
- Large Partition Handling Improvements
- Optimizing Reverse Queries
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Smart monitoring how does oracle rac manage resource, state ukoug19Anil Nair
An important requirement for HA and to provide scalability is to detect problems and resolve them quickly before the user sessions get affected. Oracle RAC along with its Family of Solutions work together cohesively to detect conditions such as "Un-responsive Instances", Network issues quickly and resolve them by either redirecting the work to other instances or redundant network paths
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
ScyllaDB adopted Raft as a consensus protocol in order to dramatically improve our operational aspects as well as provide strong consistency to the end-user. This talk will explain how Raft behaves in Scylla Open Source 5.0 and introduce the first end-user visible major improvement: schema changes. Learn how cluster configuration resides in Raft, providing consistent cluster assembly and configuration management. This makes bootstrapping safer and provides reliable disaster recovery when you lose the majority of the cluster.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Anoop Sam John and Ramkrishna Vasudevan (Intel)
HBase provides an LRU based on heap cache but its size (and so the total data size that can be cached) is limited by Java’s max heap space. This talk highlights our work under HBASE-11425 to allow the HBase read path to work directly from the off-heap area.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
GNU Toolchain is the de facto standard of IT industrial and has been improved by comprehensive open source contributions. In this session, it is expected to cover the mechanism of compiler driver, system interaction (take GNU/Linux for example), linker, C runtime library, and the related dynamic linker. Instead of analyzing the system design, the session is use case driven and illustrated progressively.
Getting the Scylla Shard-Aware Drivers FasterScyllaDB
Alexys will explain how Scylla shard-aware drivers are implemented and why Scylla benefits from them. Taking the Python shard-aware driver as an example, Alexys will demonstrate how recent shard-aware drivers can leverage the new shard allocation algorithm that Scylla 4.3 brings to the table, and how to make use of it from a developer or administrator point of view. Alexys will showcase benefits from real production graphs observed at Numberly.
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
LiquiBase is an open source tool for tracking, managing and applying database changes, where database changes are stored in an XML file called a changelog that is executed to handle different revisions. It aims to provide consistent database changes across environments by managing databases at different states and keeping a history of all changes made through automatic rollback support and ability to effectively manage variable changes. Problems with manual database changes include inconsistent application of changes and databases becoming out of sync between environments.
The document describes Apache Hive hooks, which allow intercepting function calls or events during query execution in Hive. It provides details on the different hook points in Hive, including pre-execution, post-execution, and failure hooks. It also explains how to configure hooks by setting hook properties and the jar paths for hook implementations. Finally, it outlines the interfaces and contexts provided to hooks at each stage of query processing in Hive.
Patterns and Operational Insights from the First Users of Delta LakeDatabricks
Delta Lake was used to ingest and process large volumes of data, from tens of TB per day to hundreds of TB per day. Key patterns discussed include extracting and loading data from S3 into streaming tables, transforming the data through parsing, merging logic for debugging, using stateful transformations, building aggregation funnels, handling deduplication through SCD updates, optimizing storage and metadata for high scale tables, and addressing patterns like composability, schema ordering, partitioning, and handling conflicting transactions.
The document discusses Oracle Real Application Clusters (RAC) architecture and internals. A typical RAC configuration includes multiple nodes connected to a public network, interconnect, and shared storage. Oracle Grid Infrastructure manages the clusterware and Automatic Storage Management. It provides high availability of databases and other applications by enabling them to run on multiple nodes and utilize the shared storage. The document covers various RAC components like VIPs, listeners, SCAN, client connectivity, node membership, and the interconnect.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
PHP Coding Standard and 50+ Programming SkillsHo Kim
1. How and Why to write good code?
2. Coding standard based on ZendFramework and real world practise.
3. PHP programming skills from daily coding.
4. Some security tips
5. Some optimization tips
Lixun Peng presents Double Sync Replication as a solution to problems with asynchronous and semi-synchronous replication. Double Sync Replication uses two replication channels - an asynchronous channel to continuously replicate binary logs from the master, and a semi-synchronous channel to replicate the latest binary logs and position. This allows the slave to always know the latest position on the master and compare logs from both channels to determine consistency. The asynchronous channel is used to fully apply logs when the network is down to catch the slave back up.
This document discusses several features that Alibaba is developing or contributing to MariaDB, including:
1. Time Machine/Flashback, which allows rolling databases back to snapshots by reversing DML operations in binary logs.
2. Double Sync Replication, which combines asynchronous and semi-synchronous replication to ensure the slave always knows the master's status.
3. Multi-source replication, which allows a single slave to replicate from multiple masters to support data sharding and backups across instances.
4. Thread Memory Monitor, which tracks memory usage by thread to identify which queries are using the most memory when the mysqld process exceeds limits.
Time Machine allows rolling back databases, tables, or instances to a previous snapshot by replaying binary logs in reverse. It works at the server level for all storage engines and formats binary logs using full images. Currently, it is a feature inside the mysqlbinlog tool. The tool reverses DML operations by changing event types and swapping SET and WHERE clauses to recover data modified in errors. Future work includes adding support for DDL statements and global transaction IDs.
The document discusses maintaining a dynamic dictionary on disk and describes the state of the art which includes B-trees and several variants. It then introduces the fractal tree, which is a replacement for traditional B-trees that can perform high entropy inserts and deletes up to 100 times faster without suffering from aging effects on range queries. Experimental results show the fractal tree implemented in TokuDB provides 10-100x faster index inserts and faster queries compared to traditional B-trees.
A binary graphics recognition algorithm based on fitting functionLixun Peng
This document proposes a binary graphics recognition algorithm based on fitting functions. It involves fitting line segments from graphics with polynomial functions, and comparing the fitting functions to templates in a knowledge base to identify the graphics. The algorithm represents graphics as sets of line segments, feature points, and best fitting vectors. It then uses the analysis of variance of fitting functions to recognize graphics by finding the most similar template function.