Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans.
We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation.
In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans.
We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation.
In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...Monica Beckwith
G1 GC Presentation @ JavaOne 2013
Sneak a peek under the hood of the latest and coolest garbage collector, Garbage-First!
Dive deep into G1's adaptability and ergonomics
Discuss the future of G1's adaptability
Apache Spark's Built-in File Sources in DepthDatabricks
In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. The talk will also explain the differences between Hive Serde and native connectors, and share the experiences of how to tune the connectors and choose the best data layouts for achieving the best performance.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...HostedbyConfluent
The presentation highlights the main technical challenges Radicalbit faced while building a real-time serving engine for streaming Machine Learning algorithms. The speech describes how Kafka has been used to fasten two ML technologies together: River, an open-source suite of streaming machine learning algorithms, and Seldon-core, a DevOps-driven MLOps platform.
In particular, the talk focuses on how Kafka has been used to (1) build a dynamic model serving framework thanks to Kafka Streams joins and the broadcasting pattern (2) implement a Kafka user-given feedback topic by which online models can learn while they generate predictions, and (3) design a models' prediction bus, a particular Kafka bidirectional topic whereby predictions flow at tremendous scale; the prediction bus enabled seldon-core Kubernetes deployment to communicate with Kafka Streams, and as a conclusive subject this speech explains how this unleashed unprecedented performance.
Best Practices for Managing MongoDB with Ops ManagerMongoDB
Speaker: Arkadiusz Borucki, Mongo Database Administrator, Amadeus Data Processing GmbH
Speaker: Paul Hubert, Amadeus
Level: 300 (Advanced)
Track: Operations
Amadeus has developed its industrialization and automation around OPS manager. We manage a large environment with 50 clusters, some of which run more than 100 shards. OPS manager is leveraged to drive 100% automation for new cluster deployments, upgrades, backups, under deployment tools like Ansible and Puppet. The solution is compliant to our strict monitoring, security and DR requirements.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
This talk presents 3 programming situations where typeclasses and generics are not adequate: evolving serialization protocols, data generation, modular applications. A library, registry, can be used to help with those 3 situations by giving us the means to wire and rewire code at will.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Garbage First Garbage Collector (G1 GC): Current and Future Adaptability and ...Monica Beckwith
G1 GC Presentation @ JavaOne 2013
Sneak a peek under the hood of the latest and coolest garbage collector, Garbage-First!
Dive deep into G1's adaptability and ergonomics
Discuss the future of G1's adaptability
Apache Spark's Built-in File Sources in DepthDatabricks
In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. The talk will also explain the differences between Hive Serde and native connectors, and share the experiences of how to tune the connectors and choose the best data layouts for achieving the best performance.
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
Timely data in a data warehouse is a challenge many of us face, often with there being no straightforward solution.
Using a combination of batch and streaming data pipelines you can leverage the Delta Lake format to provide an enterprise data warehouse at a near real-time frequency. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse. In this talk, we’ll talk about how to use Delta Lake to improve the latency of ingestion and storage of your data warehouse tables. We’ll also talk about how you can use spark streaming to build the aggregations and tables that drive your data warehouse.
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...HostedbyConfluent
The presentation highlights the main technical challenges Radicalbit faced while building a real-time serving engine for streaming Machine Learning algorithms. The speech describes how Kafka has been used to fasten two ML technologies together: River, an open-source suite of streaming machine learning algorithms, and Seldon-core, a DevOps-driven MLOps platform.
In particular, the talk focuses on how Kafka has been used to (1) build a dynamic model serving framework thanks to Kafka Streams joins and the broadcasting pattern (2) implement a Kafka user-given feedback topic by which online models can learn while they generate predictions, and (3) design a models' prediction bus, a particular Kafka bidirectional topic whereby predictions flow at tremendous scale; the prediction bus enabled seldon-core Kubernetes deployment to communicate with Kafka Streams, and as a conclusive subject this speech explains how this unleashed unprecedented performance.
Best Practices for Managing MongoDB with Ops ManagerMongoDB
Speaker: Arkadiusz Borucki, Mongo Database Administrator, Amadeus Data Processing GmbH
Speaker: Paul Hubert, Amadeus
Level: 300 (Advanced)
Track: Operations
Amadeus has developed its industrialization and automation around OPS manager. We manage a large environment with 50 clusters, some of which run more than 100 shards. OPS manager is leveraged to drive 100% automation for new cluster deployments, upgrades, backups, under deployment tools like Ansible and Puppet. The solution is compliant to our strict monitoring, security and DR requirements.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
This talk presents 3 programming situations where typeclasses and generics are not adequate: evolving serialization protocols, data generation, modular applications. A library, registry, can be used to help with those 3 situations by giving us the means to wire and rewire code at will.
4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型:
a) 无锁无事务脏写,一阶段异步;
b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务;
c) 有锁同步事务,一阶段同步锁,两阶段同步事务;
d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务;
e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority
事务;
2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商
工作:
a) 节点发现;
b) 节点协议版本协商;
c) 节点 schema 合并;
d) 节点事务 decision 合并;
i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告
{inconsistent_database, bad_decision, Node},本节点事务结果改为 abort;
ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为
abort,此时远程节点将进行修改和通报;
iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节
点事务结果,远程节点进行修改;
iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务
结果的节点启动,并按照其结果作为事务结果;
v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
5. vi. 事务 decision 并不真正影响实际的数据内容;
e) 节点表数据合并:
i. 若本节点为 master 节点,则本节点从磁盘加载表数据;
ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据;
iii. 若远程节点存活,则从远程节点拉取表数据;
iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数
据;
v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动
加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访
问;
vi. 若表数据已经加载,则不会再从远程节点拉取表数据;
vii. 从集群角度看:
1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图;
2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视
,
图;
3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图,
各个分区依旧保持分区状态;
3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对
事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不
一致,此时将通告应用者一个 inconsistent_database 事件:
a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在
远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
49. end,
add_remote_decisions(Node, Tail, State);
add_remote_decisions(_Node, [], State) ->
State.
add_remote_decision(Node, NewD, State) ->
Tid = NewD#decision.tid,
OldD = decision(Tid),
%%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而
发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo
日志进行重构
D = merge_decisions(Node, OldD, NewD),
%%记录合并结果
do_log_decision(D, false, undefined),
Outcome = D#decision.outcome,
if
OldD == no_decision -> ignore;
Outcome == unclear -> ignore;
true ->
case lists:member(node(), NewD#decision.disc_nodes) or
lists:member(node(), NewD#decision.ram_nodes) of
true ->
%%向其它节点告知本节点的 decision 合并结果
tell_im_certain([Node], D);
false -> ignore
end
end,
case State#state.unclear_decision of
U when U#decision.tid == Tid ->
WaitFor = State#state.unclear_waitfor -- [Node],
if
Outcome == unclear, WaitFor == [] ->
%% Everybody are uncertain, lets abort
%%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交
结果,此时决定终止事务
NewOutcome = aborted,
CertainD = D#decision{outcome = NewOutcome,
50. disc_nodes = [],
ram_nodes = []},
tell_im_certain(D#decision.disc_nodes, CertainD),
tell_im_certain(D#decision.ram_nodes, CertainD),
do_log_decision(CertainD, false, undefined),
verbose("Decided to abort transaction ~p "
"since everybody are uncertain ~p~n",
[Tid, CertainD]),
gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome /= unclear ->
%%发送节点知道事务结果,通告事务结果
verbose("~p told us that transaction ~p was ~p~n",
[Node, Tid, Outcome]),
gen_server:reply(State#state.unclear_pid, {ok, Outcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome == unclear ->
%%发送节点也不知道事务结果,此时继续等待
State#state{unclear_waitfor = WaitFor}
end;
_ ->
State
end.
合并策略:
merge_decisions(Node, D, NewD0) ->
NewD = filter_aborted(NewD0),
if
D == no_decision, node() /= Node ->
%% We did not know anything about this txn
NewD#decision{disc_nodes = []};
D == no_decision ->
NewD;
is_record(D, decision) ->
DiscNs = D#decision.disc_nodes -- ([node(), Node]),
OldD = filter_aborted(D#decision{disc_nodes = DiscNs}),
if
51. OldD#decision.outcome == unclear,
NewD#decision.outcome == unclear ->
D;
OldD#decision.outcome == NewD#decision.outcome ->
%% We have come to the same decision
OldD;
OldD#decision.outcome == committed,
NewD#decision.outcome == aborted ->
%%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发
送节点中止事务,此时仍然选择中止事务
Msg = {inconsistent_database, bad_decision, Node},
mnesia_lib:report_system_event(Msg),
OldD#decision{outcome = aborted};
OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
OldD#decision.outcome == committed,
NewD#decision.outcome == unclear -> OldD#decision{outcome = committed};
OldD#decision.outcome == unclear,
NewD#decision.outcome == committed -> OldD#decision{outcome = committed}
end
end.
2. 节点发现,集群遍历
mnesia_controller.erl
merge_schema() ->
AllNodes = mnesia_lib:all_nodes(),
%%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移
case try_merge_schema(AllNodes, [node()], fun default_merge/1) of
ok ->
%%合并 schema 成功后,将进行数据合并
schema_is_merged();
{aborted, {throw, Str}} when is_list(Str) ->
fatal("Failed to merge schema: ~s~n", [Str]);
Else ->
fatal("Failed to merge schema: ~p~n", [Else])
end.
52. try_merge_schema(Nodes, Told0, UserFun) ->
%%开始集群遍历,启动一个 schema 合并事务
case mnesia_schema:merge_schema(UserFun) of
{atomic, not_merged} ->
%% No more nodes that we need to merge the schema with
%% Ensure we have told everybody that we are running
case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of
[] -> ok;
Tell ->
im_running(Tell, [node()]),
ok
end;
{atomic, {merged, OldFriends, NewFriends}} ->
%% Check if new nodes has been added to the schema
Diff = mnesia_lib:all_nodes() -- [node() | Nodes],
mnesia_recover:connect_nodes(Diff),
%% Tell everybody to adopt orphan tables
%%通知所有的集群节点,本节点启动,开始数据合并申请
im_running(OldFriends, NewFriends),
im_running(NewFriends, OldFriends),
Told = case lists:member(node(), NewFriends) of
true -> Told0 ++ OldFriends;
false -> Told0 ++ NewFriends
end,
try_merge_schema(Nodes, Told, UserFun);
{atomic, {"Cannot get cstructs", Node, Reason}} ->
dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]),
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
{aborted, {shutdown, _}} -> %% One of the nodes is going down
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
Other ->
Other
end.
mnesia_schema.erl
merge_schema() ->
schema_transaction(fun() -> do_merge_schema([]) end).
merge_schema(UserFun) ->
schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).
可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
53. 题操作包括:
{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}
{op, merge_schema, CstructList}
这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。
do_merge_schema(LockTabs0) ->
%% 锁 schema 表
{_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write),
LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0],
[get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs],
Connected = val(recover_nodes),
Running = val({current, db_nodes}),
Store = Ts#tidstore.store,
%% Verify that all nodes are locked that might not be the
%% case, if this trans where queued when new nodes where added.
case Running -- ets:lookup_element(Store, nodes, 2) of
[] -> ok; %% All known nodes are locked
Miss -> %% Abort! We don't want the sideeffects below to be executed
mnesia:abort({bad_commit, {missing_lock, Miss}})
end,
%% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点;
Running
是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点;
case Connected -- Running of
%% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进
行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法)
,这个过
程由某个节点发起,
[Node | _] = OtherNodes ->
%% Time for a schema merging party!
mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]),
[mnesia_locker:wlock_no_exist(
Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes))
|| {T,Ns} <- LockTabs],
%% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1
case fetch_cstructs(Node) of
{cstructs, Cstructs, RemoteRunning1} ->
57. ignore; %% from do_merge_schema
true ->
%% If a node has restarted it may still linger in db_nodes,
%% but have been removed from recover_nodes
Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]),
NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current,
mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
announce_im_running(NewNodes, SchemaCs)
end,
{false, optional};
此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协
商,协商通过后,这些未连接节点将加入本节点的事务节点集群
反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:
mnesia_tm.erl
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of
{Modified, C = #commit{}, DumperMode} ->
%% If we can not find any local unclear decision
%% we should presume abort at startup recovery
case lists:member(node(), DiscNs) of
false ->
ignore;
true ->
case Modified of
false -> mnesia_log:log(Bin);
true -> mnesia_log:log(C)
end
end,
?eval_debug_fun({?MODULE, commit_participant, vote_yes},
[{tid, Tid}]),
reply(Coord, {vote_yes, Tid, self()}),
receive
{Tid, pre_commit} ->
…
receive
{Tid, committed} ->
…
{Tid, {do_abort, _Reason}} ->