*If you see the screen is not good condition, downloading please.*
MariaDB Optimization
- 풀 테이블 스캔
- ORDER BY 처리(Using filesort)
- GROUP BY 처리
- DISTINCT 처리
- 임시 테이블(Using Tempoary)
- 인덱스 컨디션 푸시다운(Index Condition Pushdown, ICP)
- 멀티 레인지 리드(Multi Range Read)
- 인덱스 머지(Index merge)
- 테이블 조인
- 서브 쿼리
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
Netflix’s architecture involves thousands of microservices built to serve unique business needs. As this architecture grew, it became clear that the data storage and query needs were unique to each area; there is no one silver bullet which fits the data needs for all microservices. CDE (Cloud Database Engineering team) offers polyglot persistence, which promises to offer ideal matches between problem spaces and persistence solutions. In this meetup you will get a deep dive into the Self service platform, our solution to repairing Cassandra data reliably across different datacenters, Memcached Flash and cross region replication and Graph database evolution at Netflix.
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...PGConf APAC
Speaker: Muhammad Usama
Pgpool-II has been around to complement PostgreSQL over a decade and provides many features like connection pooling, failover, query caching, load balancing, and HA. High Availability (HA) is very critical to most enterprise application, the clients needs the ability to automatically reconnect with a secondary node when the master nodes goes down.
This is where Pgpool-II watchdog feature comes in, the core feature of Pgpool-II provides HA by eliminating the SPOF is the Watchdog. This watchdog feature has been around for a while but it went through major overhauling and enhancements in recent releases. This talk aims to explain the watchdog feature, the recent enhancements went into the watchdog and describe how it can be used to provide PostgreSQL HA and automatic failover.
Their is rising trend of enterprise deployment shifting to cloud based environment, Pgpool II can be used in the cloud without any issues. In this talk we will give some ideas how Pgpool-II is used to provide PostgreSQL HA in cloud environment.
Finally we will summarise the major features that have been added in the recent major release of Pgpool II and whats in the pipeline for the next major release.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
*If you see the screen is not good condition, downloading please.*
MariaDB Optimization
- 풀 테이블 스캔
- ORDER BY 처리(Using filesort)
- GROUP BY 처리
- DISTINCT 처리
- 임시 테이블(Using Tempoary)
- 인덱스 컨디션 푸시다운(Index Condition Pushdown, ICP)
- 멀티 레인지 리드(Multi Range Read)
- 인덱스 머지(Index merge)
- 테이블 조인
- 서브 쿼리
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
Netflix’s architecture involves thousands of microservices built to serve unique business needs. As this architecture grew, it became clear that the data storage and query needs were unique to each area; there is no one silver bullet which fits the data needs for all microservices. CDE (Cloud Database Engineering team) offers polyglot persistence, which promises to offer ideal matches between problem spaces and persistence solutions. In this meetup you will get a deep dive into the Self service platform, our solution to repairing Cassandra data reliably across different datacenters, Memcached Flash and cross region replication and Graph database evolution at Netflix.
PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...PGConf APAC
Speaker: Muhammad Usama
Pgpool-II has been around to complement PostgreSQL over a decade and provides many features like connection pooling, failover, query caching, load balancing, and HA. High Availability (HA) is very critical to most enterprise application, the clients needs the ability to automatically reconnect with a secondary node when the master nodes goes down.
This is where Pgpool-II watchdog feature comes in, the core feature of Pgpool-II provides HA by eliminating the SPOF is the Watchdog. This watchdog feature has been around for a while but it went through major overhauling and enhancements in recent releases. This talk aims to explain the watchdog feature, the recent enhancements went into the watchdog and describe how it can be used to provide PostgreSQL HA and automatic failover.
Their is rising trend of enterprise deployment shifting to cloud based environment, Pgpool II can be used in the cloud without any issues. In this talk we will give some ideas how Pgpool-II is used to provide PostgreSQL HA in cloud environment.
Finally we will summarise the major features that have been added in the recent major release of Pgpool II and whats in the pipeline for the next major release.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
In this presentation I am illustrating how and why InnodDB perform Merge and Split pages. I will also show what are the possible things to do to reduce the impact.
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)涛 吴
This slide delivered by Zuoyan Qin, Chief engineer from XiaoMi Cloud Storage Team, was for talk at Arch summit Beijing-2016 regarding how Pegasus was designed.
This document summarizes information about state transfers in Galera, an open source replication plugin for MySQL. It discusses when state transfers are needed, such as during network partitions or node failures. It describes the two types of state transfers - incremental state transfers (IST), which sync nodes incrementally, and state snapshot transfers (SST), which transfer a full data copy. The document compares different methods for SST, such as mysqldump, RSYNC, Xtrabackup, and Clone, and discusses their speeds, impact on donor nodes, and other characteristics. Overall, the document provides an overview of state transfers in Galera synchronous replication.
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
This document discusses best practices for using PySpark. It covers:
- Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle.
- Recommended project structure with modules for data I/O, feature engineering, and modeling.
- Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections.
- Tips for testing like unit testing functions and integration testing the full workflow.
- Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
This document discusses Dell's support for CEPH storage solutions and provides an agenda for a CEPH Day event at Dell. Key points include:
- Dell is a certified reseller of Red Hat-Inktank CEPH support, services, and training.
- The agenda covers why Dell supports CEPH, hardware recommendations, best practices shared with CEPH colleagues, and a concept for research data storage that is seeking input.
- Recommended CEPH architectures, components, configurations, and considerations are discussed for planning and implementing a CEPH solution. Dell server hardware options that could be used are also presented.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
This document discusses streaming replication in PostgreSQL. It covers how streaming replication works, including the write-ahead log and replication processes. It also discusses setting up replication between a primary and standby server, including configuring the servers and verifying replication is working properly. Monitoring replication is discussed along with views and functions for checking replication status. Maintenance tasks like adding or removing standbys and pausing replication are also mentioned.
Oracle Warehouse Builder is Oracle's tool for designing, deploying, and managing business intelligence and data integration projects on the Oracle database. It provides a graphical environment to extract, transform, and load data from various sources into a Oracle data warehouse or datamarts. Warehouse Builder manages the full lifecycle of metadata and data, and enables users to design and deploy ETL processes, reporting infrastructure, and manage the target schema.
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...confluent
Do you ever feel that your stream processor gets in the way of expressing business requirements? Most processors are frameworks, which are highly opinionated in the design and implementation of apps. Performing Complex Event Processing invariably leads to calling out to other technologies, but what if that integration didn’t require an RPC call or could be modeled into your stream itself? This talk will explore how to build rich domain, low latency, back-pressured, and stateful streaming applications that require very little infrastructure, using Akka Streams and the Alpakka Kafka connector.
We will explore how Alpakka Kafka maps to Kafka features in order to provide a comprehensive understanding of how to build a robust streaming platform. We’ll explore transactional message delivery, defensive consumer group rebalancing, stateful stages, and state durability/persistence. Akka Streams is built on top of Akka, an asynchronous messaging-driven middleware toolkit that can be used to build Erlang-like Actor Systems in Java or Scala. It is used as a JVM library to facilitate common streaming semantics within an existing or standalone application. It’s different from other stream processors in several ways. It natively supports back-pressure flow control inside a single JVM instance or across distributed systems to help prevent overloading downstream infrastructure. It’s perfect for modeling Complex Event Processing with its easy integration into existing apps and Akka Actor systems. Also, unlike most acyclic stream processors, Akka Streams can support sophisticated pipelines, or Graphs, by allowing the user to model cycles (loops) when there’s a need.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Anti-entropy repairs are known to be a very peculiar maintenance operation of Cassandra clusters. They are problematic mostly because of the potential of having negative impact on the cluster's performance. Another problematic aspect is the difficulty of managing the repairs of Cassandra clusters in a careful way that would prevent the negative performance impact.
Based on the long-term pain we have been experiencing with managing repairs of nearly 100 Cassandra clusters, and being unable to find a solution that would meet our needs, we went ahead and developed an open-source tool, named Cassandra Reaper [1], for easy management of Cassandra repairs.
Cassandra Reaper is a tool that automates the management of anti-entropy repairs of Cassandra clusters in a rather smart, efficient and careful manner while requiring minimal Cassandra expertise.
I will have to cover some basics of eventual consistency mechanisms of Cassandra, after which I will be able to focus on the features of Cassandra Reaper and our six months of experience having the tool managing the repairs of our production clusters.
The document discusses using plProxy and pgBouncer to split a PostgreSQL database horizontally and vertically to improve scalability. It describes how plProxy allows functions to make remote calls to other databases and how pgBouncer can be used for connection pooling. The RUN ON clause of plProxy is also summarized, which allows queries to execute on all partitions or on a specific partition.
The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark.
As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions:
- How to reach higher level of parallelism of your jobs without scaling up your cluster?
- Understanding shuffles, and how to avoid disk spills
- How to identify task stragglers and data skews?
- How to identify Spark bottlenecks?
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
In this presentation I am illustrating how and why InnodDB perform Merge and Split pages. I will also show what are the possible things to do to reduce the impact.
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)涛 吴
This slide delivered by Zuoyan Qin, Chief engineer from XiaoMi Cloud Storage Team, was for talk at Arch summit Beijing-2016 regarding how Pegasus was designed.
This document summarizes information about state transfers in Galera, an open source replication plugin for MySQL. It discusses when state transfers are needed, such as during network partitions or node failures. It describes the two types of state transfers - incremental state transfers (IST), which sync nodes incrementally, and state snapshot transfers (SST), which transfer a full data copy. The document compares different methods for SST, such as mysqldump, RSYNC, Xtrabackup, and Clone, and discusses their speeds, impact on donor nodes, and other characteristics. Overall, the document provides an overview of state transfers in Galera synchronous replication.
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
This document discusses best practices for using PySpark. It covers:
- Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle.
- Recommended project structure with modules for data I/O, feature engineering, and modeling.
- Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections.
- Tips for testing like unit testing functions and integration testing the full workflow.
- Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
This document discusses Dell's support for CEPH storage solutions and provides an agenda for a CEPH Day event at Dell. Key points include:
- Dell is a certified reseller of Red Hat-Inktank CEPH support, services, and training.
- The agenda covers why Dell supports CEPH, hardware recommendations, best practices shared with CEPH colleagues, and a concept for research data storage that is seeking input.
- Recommended CEPH architectures, components, configurations, and considerations are discussed for planning and implementing a CEPH solution. Dell server hardware options that could be used are also presented.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
This document discusses streaming replication in PostgreSQL. It covers how streaming replication works, including the write-ahead log and replication processes. It also discusses setting up replication between a primary and standby server, including configuring the servers and verifying replication is working properly. Monitoring replication is discussed along with views and functions for checking replication status. Maintenance tasks like adding or removing standbys and pausing replication are also mentioned.
Oracle Warehouse Builder is Oracle's tool for designing, deploying, and managing business intelligence and data integration projects on the Oracle database. It provides a graphical environment to extract, transform, and load data from various sources into a Oracle data warehouse or datamarts. Warehouse Builder manages the full lifecycle of metadata and data, and enables users to design and deploy ETL processes, reporting infrastructure, and manage the target schema.
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...confluent
Do you ever feel that your stream processor gets in the way of expressing business requirements? Most processors are frameworks, which are highly opinionated in the design and implementation of apps. Performing Complex Event Processing invariably leads to calling out to other technologies, but what if that integration didn’t require an RPC call or could be modeled into your stream itself? This talk will explore how to build rich domain, low latency, back-pressured, and stateful streaming applications that require very little infrastructure, using Akka Streams and the Alpakka Kafka connector.
We will explore how Alpakka Kafka maps to Kafka features in order to provide a comprehensive understanding of how to build a robust streaming platform. We’ll explore transactional message delivery, defensive consumer group rebalancing, stateful stages, and state durability/persistence. Akka Streams is built on top of Akka, an asynchronous messaging-driven middleware toolkit that can be used to build Erlang-like Actor Systems in Java or Scala. It is used as a JVM library to facilitate common streaming semantics within an existing or standalone application. It’s different from other stream processors in several ways. It natively supports back-pressure flow control inside a single JVM instance or across distributed systems to help prevent overloading downstream infrastructure. It’s perfect for modeling Complex Event Processing with its easy integration into existing apps and Akka Actor systems. Also, unlike most acyclic stream processors, Akka Streams can support sophisticated pipelines, or Graphs, by allowing the user to model cycles (loops) when there’s a need.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Anti-entropy repairs are known to be a very peculiar maintenance operation of Cassandra clusters. They are problematic mostly because of the potential of having negative impact on the cluster's performance. Another problematic aspect is the difficulty of managing the repairs of Cassandra clusters in a careful way that would prevent the negative performance impact.
Based on the long-term pain we have been experiencing with managing repairs of nearly 100 Cassandra clusters, and being unable to find a solution that would meet our needs, we went ahead and developed an open-source tool, named Cassandra Reaper [1], for easy management of Cassandra repairs.
Cassandra Reaper is a tool that automates the management of anti-entropy repairs of Cassandra clusters in a rather smart, efficient and careful manner while requiring minimal Cassandra expertise.
I will have to cover some basics of eventual consistency mechanisms of Cassandra, after which I will be able to focus on the features of Cassandra Reaper and our six months of experience having the tool managing the repairs of our production clusters.
The document discusses using plProxy and pgBouncer to split a PostgreSQL database horizontally and vertically to improve scalability. It describes how plProxy allows functions to make remote calls to other databases and how pgBouncer can be used for connection pooling. The RUN ON clause of plProxy is also summarized, which allows queries to execute on all partitions or on a specific partition.
The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark.
As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions:
- How to reach higher level of parallelism of your jobs without scaling up your cluster?
- Understanding shuffles, and how to avoid disk spills
- How to identify task stragglers and data skews?
- How to identify Spark bottlenecks?
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
공유 스토리지를 이용한 H/A Cluster 뿐만 아니라
Replication을 이용한 Shared Nothing H/A Cluster 제공
내장된 Application 인지형의 고가용성 기능 제공
DB에 대하여 이중으로 Check 하는 Depth 모니터링 기능
30개의 주요한 Applications 지원
게임 서비스를 위한 AWS상의 고성능 SQL 데이터베이스 구성 (이정훈 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon Web Services Korea
게임 서비스를 위한 AWS상의 고성능 SQL 데이터베이스 구성
게임 서비스 아키텍처에서 관계형 데이터베이스는 핵심 컴포넌트이며 또한 전체 서비스의 성능 병목 지점이 되곤 합니다. 이 세션에서는 AWS 상에서 게임 서비스를 구현할 때, 기존 물리환경에서의 DB 성능과 동일하거나 더 높은 성능을 얻을 수 있는 구성을 설명 드리며, MS SQL 구성의 성능 데모를 시연하고자 합니다.
This document provides recommendations for system capacity planning for an Oracle database:
- Plan for 1 CPU per 200 concurrent users and prefer medium speed CPUs over fewer faster CPUs.
- Reserve 10% of memory for the operating system and allocate 220 MB for the Oracle SGA and 3 MB per user process.
- Use striped and mirrored or striped with parity RAID for disks. Consider raw devices or SANs if possible.
- Ensure the network capacity is adequate based on site size.
1. Part 1 ORACLE │161
KEEP BUFFER 활용 방안
㈜엑셈 컨설팅본부/DB컨설팅팀 장 정민
개요
Oracle 은 유저가 요청한 작업을 빠르게 처리하기 위해 Buffer Cache 라는것을 사용한다.
Buffer Cache 는 SGA 에 위치하고 있으며, 오라클 인스턴스에 접속하는 모든 프로세스에 의해
공유된다. 이 Buffer Cache 는 오라클 I/O 관리의 핵심으로 자주 사용하는 데이터 파일의 블록
들을 메모리에 상주시킴으로써 물리적인 I/O Operation 을 줄이는 역할을 한다. Buffer Cache
를 효과적으로 사용하면 물리적 I/O 가 줄어들고 자연스럽게 I/O 성능문제를 해결할 수 있다.
오라클의 버전이 올라감에 따라 Buffer Cache 를 처리하는 알고리즘은 끊임없이 개선되었고,
더불어 새로운 관리 방법들이 제공되었다. Oracle 7 까지의 Buffer Cache 는 하나의 틀로서 운
영되었고, 각 Object 의 특성에 따른 차별적인 Buffer Cache 이용이 어려웠다. 이런 면을 해결
하기 위해 Oracle 8 부터 Multiple buffer pool 이라는 기능을 지원하게 되었는데, 이로 인해 각
Object 의 특성이나 엑세스 빈도 등 차별성을 고려하여 Buffer Cache 를 보다 세밀하게 관리 할
수 있게 되었다.
Keep Buffer 는 이 Multiple Buffer Pool 을 구성하는 여러 영역중의 하나이다.
KEEP Buffer 사용 목적 및 특성
Keep Buffer 의 사용 목적은 본래의 Buffer Cache 의 목적과 마찬가지로, Object 를 메모리에
상주시킴으로써 물리적인 I/O 를 피하는데 있다.
Keep Buffer Pool 과 Default Buffer Pool 이 데이터 블록을 Cache 하거나 Flush 할 때 서로 다
른 알고리즘을 사용하지는 않는다. 그럼에도 불구하고 하나였던 Buffer Cache 영역을 Multiple
Buffer Pool 로 나누게 된 이유는 Object 의 특성을 고려한 Buffer Cache 이용을 위함이다.
2. 162│2013 기술백서 White Paper
KEEP Buffer 의 메모리 공간은 Sequential 하게 관리된다. 한번 KEEP 된 세그먼트는 KEEP
Buffer 의 메모리 공간을 모두 할당하기 전까지 메모리에 유지되다가 KEEP Buffer 공간을 모두
할당하게 되면, 가장 오래된 블록부터 default pool 로 밀려난다. 때문에 KEEP 대상이 되는 세
그먼트들의 사이즈를 정확히 계산하여 KEEP Buffer 크기를 적절히 할당해야 한다.
KEEP Buffer 사용 순서
Keep Buffer 의 사용은 다음과 같은 순서로 진행한다.
1. KEEP 대상 선정하기
2. OS Memory & SGA 공간 확인
3. KEEP Buffer 설정
4. 테이블/인덱스 속성 변경
5. 테이블/인덱스 Keeping
6. KEEP 효율성 체크
7. KEEP 대상 선정 기준
KEEP 대상 선정 하기
KEEP 대상 선정에 있어서 명확한 기준점은 없다. 실제 업무를 고려하여 각 DB 의 운영환경에
맞는 대상을 선정해야 한다. 만약 KEEP 하는 대상이 아주 빈번하게 사용되는 블록 이라면, 기본
적인 Default Buffer 를 사용해도 Cache 돼있을 가능성이 높은데, 이런 경우에는 Keep Buffer
사용이 성능상의 이점을 가져오지 못한다. 반대로 자주 사용되지 않는 블록을 Keep Buffer 에
상주시킨다면, 사용하지 않는 메모리를 가지고 있는 것이기 때문에 전반적인 성능을 저하시키는
요인이 될 수도 있다.
때문에 Keep 대상을 선정하는데 있어서는 실제 업무의 고려가 필수적이다. 예를 들어 하루에 1
번만 수행되는 프로그램인데 어떻게 해서든 수행시간을 단축시켜야 하는 경우나, 많은 업무를
처리하는 시간대에 꼭 수행 되야 하는데 많은 I/O 때문에 병목현상을 일으켜서 시스템에 전반적
으로 악영향을 끼치는 프로그램 등이 있을 수 있다. 이런 프로그램들이 사용하는 Object 들이
3. Part 1 ORACLE │163
Keep 대상이 될 수 있다. 그러나 위와 같은 업무 프로그램에 사용되는 모든 세그먼트를 KEEP
Buffer 에 상주 시킬 수 는 없다. 그러므로 KEEP Buffer 에 상주시킬 대상은 각 DB 운영환경을
고려하여 선정해야 한다. 이외에도 크기, DML 빈도, 데이터 엑세스 빈도에 따라 Keep 하기에 적
절한 세그먼트들이 존재한다.
선정 기준[1]. 프로그램 중요도
Keep 대상 선정에 있어서 가장 중요한 부분이 바로 해당 세그먼트를 조회하는 업무 프로그램의
중요도 이다. 해당 프로그램이 중요하지 않다면 굳이 Keep Buffer 를 사용해야 할 필요가 없다.
반대로 해당 세그먼트를 조회하는 프로그램이 중요도가 아주 높고 어떻게든 수행시간을 단축해
야 한다면 프로그램 수행 빈도와 상관없이 KEEP 대상으로의 선정을 고려할 수 있다.
선정 기준[2]. 세그먼트 크기
세그먼트 크기가 일정하지 않고, 과다하게 커지는 세그먼트는 Keep Buffer 의 효율성을 떨어뜨
릴 수 있다. Keep 된 세그먼트는 Keep Buffer 의 용량이 부족하면 오래된 블록부터 Default
Buffer 로 밀려나게 되는데, 크기가 계속 커지는 세그먼트가 Keep Buffer 에 존재한다면 타 세
그먼트를 조회하는 프로그램의 성능 저하를 가져올 수 있기 때문이다. 따라서 일정한 사이즈 또
는 변동 량이 심하지 않으면서 최대 크기가 일정 수준 이하인 경우의 세그먼트를 선정하는 것이
바람직하다. 예를 들면 ‘최대 크기가 10 만 블록 이하인 세그먼트’ 같은 기준을 정할 수 있다.
선정 기준[3]. Full Table Scan & Index Full Scan & Index Fast Full Scan
KEEP Buffer 에 KEEP 된 세그먼트를 조회할 때 효율성을 극대화 하기 위해서는 다소 많은 량을
처리해야 하는 경우이다. Scan 범위가 넓은 비효율 Index Scan 이나 Full Table Scan, Index
Fast Full Scan 으로 처리되는 세그먼트가 대상이 될 수 있다.
4. 164│2013 기술백서 White Paper
KEEP 대상 선정 SQLScript
SELECT owner ,
table_name ,
index_name ,
partition_name ,
SUM( blocks ) AS t_blocks
FROM (
SELECT sg.owner ,
decode( SUBSTR( s.ob_type , 1 , 5 ) , 'TABLE' , s.ob_name , 'INDEX' , (
SELECT table_name
FROM dba_indexes
WHERE index_name = s.ob_name
) ) AS table_name ,
decode( SUBSTR( s.ob_type , 1 , 5 ) , 'INDEX' , s.ob_name ) AS index_name ,
sg.partition_name ,
sg.blocks
FROM (
SELECT DISTINCT object_name AS ob_name ,
object_type AS ob_type
FROM v$sql_plan
WHERE ( operation = 'TABLE ACCESS'
AND options = 'FULL' )
OR ( operation = 'INDEX'
AND options = 'FULL SCAN' )
OR ( operation = 'INDEX'
AND options = 'FAST FULL SCAN' ) --> 선정 기준[3]
) s ,
dba_segments sg
WHERE s.ob_name = sg.segment_name
)
GROUP BY owner ,
table_name ,
index_name ,
partition_name
HAVING SUM( blocks ) > 100000 --> 선정 기준[2]SELECT * FROM DUAL
5. Part 1 ORACLE │165
OS Memory & SGA 공간 확인
OS Memory SGA 공간 확인 SQLScript
$ cat /proc/meminfo
MemTotal: 4055152 kB
MemFree: 1390308 kB
Buffers: 166768 kB
Cached: 2019992 kB
SwapCached: 0 kB
Active: 1118484 kB
Inactive: 1277864 kB
………
SGA공간 확인 SQLScript
● SGA 전체 size 확인
SELECT name ,
ROUND( bytes/1024/1024 ) "size(MB)"
FROM V$SGAINFO;
NAME size(MB)
------------------------------- ----------
Fixed SGA Size 2
Redo Buffers 5
Buffer Cache Size 48
Shared Pool Size 128
Large Pool Size 0
Java Pool Size 24
Streams Pool Size 0
Shared IO Pool Size 0
Granule Size 4
Maximum SGA Size 207
Startup overhead in Shared Pool 72
Free SGA Memory Available 0
● Data Buffer size 확인
6. 166│2013 기술백서 White Paper
SELECT name ,
current_size
FROM v$buffer_pool;
NAME CURRENT_SIZE
-------------------- ------------
DEFAULT 48
KEEP BUFFER 설정
KEEP Buffer 설정은 KEEP Buffer 크기와 SGA 여유공간에 따라, Online 작업 또는 Offline 작
업으로 수행한다. 이 문서의 스크립트는 SGA 영역의 메모리 관리를 수동으로 하는 경우를 바탕
으로 작성 하였다.
KEEP Buffer 설정 Script
@sga
NAME size(MB)
-------------------------------- ----------
Buffer Cache Size 500
Maximum SGA Size 1019
Free SGA Memory Available 228
@bc
NAME CURRENT_SIZE
-------------------- ------------
DEFAULT 500
● KEEP Buffer 의 크기가 SGA 의 Free 공간보다 작은 경우 Online ( KEEP Buffer 100M )
SQL> alter system set db_keep_cache_size = 100M scope = both;
System altered.
SQL> @sga
7. Part 1 ORACLE │167
NAME size(MB)
-------------------------------- ----------
Buffer Cache Size 600
Maximum SGA Size 1019
Free SGA Memory Available 128
SQL> @bc
NAME CURRENT_SIZE
-------------------- ------------
KEEP 100
DEFAULT 500
● KEEP Buffer 의 크기가 SGA 의 Free 공간보다 큰 경우 ( KEEP Buffer 300M )
1. SGA 전체 크기 늘린 후 KEEP Buffer 할당 Offline 작업 필요
SQL> alter system set sga_max_size = 1100M scope = spfile;
System altered.
SQL> alter system set db_keep_cache_size = 300M scope = spfile;
System altered.
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 1169227776 bytes
Fixed Size 2212696 bytes
Variable Size 301993128 bytes
Database Buffers 855638016 bytes
Redo Buffers 9383936 bytes
Database mounted.
Database opened.
SQL> @sga
8. 168│2013 기술백서 White Paper
NAME size(MB)
-------------------------------- ----------
Buffer Cache Size 816
Maximum SGA Size 1115
Free SGA Memory Available 0
SQL> @bc
NAME CURRENT_SIZE
-------------------- ------------
KEEP 304
DEFAULT 512
=> SGA 전체 크기가 1G 를 초과하면서 Granule 크기가 16Mb 로 늘어났다
이 때 지정한 값 보다 큰 16 의 배수 중 가장 작은 크기의 값이 할당된다.
2. SGA 의 다른 영역의 크기를 줄인 후 KEEP Buffer 할당 할 경우 Online 작업 가능
Online 상태에서 변경 가능 Parameter 추출 Script
SELECT name ,
issys_modifiable
FROM v$parameter
WHERE name LIKE '%size%'
AND issys_modifiable = 'IMMEDIATE'
NAME ISSYS_MOD
--------------------------------------- ---------
shared_pool_size IMMEDIATE
large_pool_size IMMEDIATE
java_pool_size IMMEDIATE
streams_pool_size IMMEDIATE
db_cache_size IMMEDIATE
db_2k_cache_size IMMEDIATE
db_4k_cache_size IMMEDIATE
db_8k_cache_size IMMEDIATE
db_16k_cache_size IMMEDIATE
db_32k_cache_size IMMEDIATE
db_keep_cache_size IMMEDIATE
db_recycle_cache_size IMMEDIATE
_shared_io_pool_size IMMEDIATE
9. Part 1 ORACLE │169
db_flash_cache_size IMMEDIATE
db_recovery_file_dest_size IMMEDIATE
result_cache_max_size IMMEDIATE
workarea_size_policy IMMEDIATE
max_dump_file_size IMMEDIATE
=> 해당 Parameter 값을 적절히 조절하여 Free Memory 확보 후 KEEP 설정
테이블/인덱스 속성 변경
테이블/인덱스 속성 변경 Script
ALTER TABLE T1 STORAGE (BUFFER_POOL KEEP); --테이블 속성 변경
ALTER INDEX T1_PK STORAGE (BUFFER_POOL KEEP); --인덱스 속성 변경
ALTER TABLE P1 MODIFY PARTITION P1_1 STORAGE (BUFFER_POOL KEEP); --파티션 테이블 속성 변경
ALTER INDEX P1_ID1 MODIFY PARTITION P1_ID1_1 STORAGE (BUFFER_POOL KEEP);--파티션 인덱스 속성 변경
테이블/인덱스 Keeping
Segment 의 Buffer Pool 이 KEEP 으로 설정된 테이블과 인덱스는 Query 시 KEEP Buffer 에
해당 세그먼트의 블록을 로딩하게 된다. 그러므로 최초 세그먼트를 Loading 할 때에는 Disk I/O
가 발생하게 된다. 만일 처음 실행하는 때를 포함하여 모든 Application 의 조회에서 Disk I/O
를 제거하고 싶다면, 업무가 진행되기 전에 해당 세그먼트들을 Full Table Scan 이나 Index
Fast Full Scan 으로 KEEP Buffer 에 로딩시키면 된다.
KEEP Buffer 효율성 판단
KEEP Buffer 의 사용에 명확한 기준이 정해져 있는 것이 아니라 운영 환경에 따라 차이가 존재
한다. 때문에 모든 운영 환경에서 같은 방법으로 효율성을 판단하기에는 무리가 있다. 하지만 다
음과 같은 자료들이 KEEP Buffer 의 효율성을 판단하는데 근거가 될 수 있다.
KEEP Buffer Size & Hit Ratio SQLScript
10. 170│2013 기술백서 White Paper
SELECT current_size keep_size ,
seg_size ,
ROUND( seg_size/current_size*100 , 1 ) "Ratio(%)"
FROM v$buffer_pool ,
(
SELECT SUM( bytes ) /1024 /1024 seg_size
FROM dba_segments
WHERE buffer_pool = 'KEEP'
)
WHERE name = 'KEEP'
KEEP_SIZE SEG_SIZE Ratio(%)
---------- ---------- ----------
304 118 38.8
SELECT db_block_gets ,
consistent_gets ,
physical_reads ,
CASE
WHEN db_block_gets+consistent_gets <> 0
THEN ROUND(( 1-( physical_reads/( db_block_gets+consistent_gets ) ) ) *100 , 2 )
END "Keep_Hit(%)"
FROM v$buffer_pool_statistics
WHERE name = 'KEEP'
DB_BLOCK_GETS CONSISTENT_GETS PHYSICAL_READS Keep_Hit(%)
------------- --------------- -------------- -----------
0 44474 4435 90.03
만약 KEEP Buffer 를 사용하는 Segment 크기의 총 합이 KEEP Buffer 크기보다 작은 경우, 해
당 Segment 들의 크기가 더 이상 커지지 않는다면, 한 번 KEEP 영역으로 올라간 Segment 의
Cache Hit Ratio 가 100%에 가깝게 될 것이다.
만일 KEEP Buffer 의 크기가 해당 영역을 사용하는 Segment 들의 크기보다 작다면 KEEP
Buffer 영역에서 경합이 발생하고, 그로 인해 Physical I/O 가 발생하여 Cache Hit Ratio 가 떨
어질 수 있다. 이때 KEEP Buffer 의 크기를 늘려주거나 중요도가 떨어지는 Segment 의 KEEP
Buffer 사용을 막는 방안을 고려 해 볼 수 있다. 반대로 KEEP Buffer 의 크기가 Segment 들의
11. Part 1 ORACLE │171
크기보다 많이 크다면, 사용하지 않는 메모리 공간을 차지하고 있는 것이므로 KEEP Buffer 의
크기를 줄이는 것을 고려 해 볼 수 있다. 따라서 시스템 성능과 Segment 들의 중요도에 따라 효
율적인 KEEP Buffer 의 크기 조절이 필요하다.
다음 스크립트로 dba_hist_seg_stat 뷰를 조회하여 Segment 조회시 발생하는 I/O 발생량에
대한 AWR 정보를 확인 하여 Segment 별 효율성을 판단 할 수 있다.
Segment I/O SQLScript
accept i_begin_time prompt 'Enter begin time[YYYYMMDDHH24]: '
accept i_end_time prompt 'Enter end time[YYYYMMDDHH24]: '
variable v_begin_time char(10)
variable v_end_time char(10)
exec :v_begin_time:=&i_begin_time
exec :v_end_time :=&i_end_time
SELECT /*+ leading(k) */
s.dbid ,
decode( SUBSTR( o.object_type , 1 , 5 ) , 'TABLE' , o.object_name , 'INDEX' , (
SELECT table_name
FROM dba_indexes
WHERE index_name = o.object_name
AND owner = k.owner
) ) AS table_name ,
decode( SUBSTR( o.object_type , 1 , 5 ) , 'INDEX' , o.object_name ) AS index_name ,
s.snap_id ,
TO_CHAR( w.begin_interval_time , 'yyyymmdd.hh24' ) AS begin_time ,
s.physical_reads_delta ,
s.physical_reads_direct_delta ,
s.physical_reads_delta + s.physical_reads_direct_delta AS total_diskio
FROM sys.wrm$_snapshot w ,
dba_hist_seg_stat s ,
dba_objects o ,
(
SELECT owner ,
segment_name
FROM dba_segments
12. 172│2013 기술백서 White Paper
WHERE buffer_pool = 'KEEP'
) k
WHERE w.begin_interval_time >= to_timestamp( '2013062510' , 'yyyymmddhh24' )
AND w.end_interval_time <= to_timestamp( '2013062518' , 'yyyymmddhh24' )
AND w.snap_id = s.snap_id
AND w.dbid = s.dbid
AND w.instance_number = s.instance_number
AND s.obj# = o.object_id
AND k.segment_name = o.object_name
AND k.owner = o.owner
ORDER BY 2 , 3 , 5
결론
KEEP Buffer 를 사용하는데 있어 가장 중요한 것이 업무의 반영일 것이다. 자주 사용하는
Object 들만 상주하는 Buffer Cache 하나만 사용하는 것이 시스템 전체의 관점에서 보면 효율
적일 수도 있다. 하지만 업무의 중요성이나 특성을 고려한다면 다른 결과가 나올 수 있다. 적게
실행 되더라도 중요도가 높은 업무가 있을 수 있고, 수행시간 단축이 매우 중요한 업무가 있을
수 있다. 이러한 업무에 대한 특성을 반영한 운영계획을 세우는데 있어서, KEEP Buffer 를 효율
적으로 사용 할 수 있다면, 시스템 성능 향상에 큰 도움이 될 것이다.