More Related Content Similar to mnesia脑裂问题综述 (20) mnesia脑裂问题综述1. 目录
1. 现象与成因............................................................................................................................... 2
2. mnesia 运行机制 ...................................................................................................................... 3
3. 常见问题与注意事项............................................................................................................... 6
4. 源码分析................................................................................................................................... 8
1. mnesia:create_schema/1 的工作过程 ............................................................................. 8
1. 主体过程................................................................................................................... 8
2. 前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9
3. 后半部分 mnesia:start/0 做的工作 ....................................................................... 19
4. mnesia:change_table_majority/2 的工作过程 .............................................................. 23
1. 调用接口................................................................................................................. 23
2. 事务操作................................................................................................................. 24
3. schema 事务提交接口 ........................................................................................... 29
4. schema 事务协议过程 ........................................................................................... 31
5. 远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34
6. 远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37
7. 请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37
8. 远程节点事务参与者第三阶段提交 commit 响应............................................... 39
9. 第三阶段提交 commit 的本地提交过程 .............................................................. 39
5. majority 事务处理 .......................................................................................................... 45
6. 恢复................................................................................................................................. 46
2. 1. 节点协议版本检查+节点 decision 通告与合并.................................................... 46
2. 节点发现,集群遍历 ............................................................................................. 51
3. 节点 schema 合并 .................................................................................................. 60
4. 节点数据合并部分 1,从远程节点加载表 .......................................................... 62
5. 节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66
6. 节点数据合并部分 2,表加载完成 ...................................................................... 71
7. 分区检测......................................................................................................................... 73
1. 锁过程中的同步检测 ............................................................................................. 73
2. 事务过程中的同步检测 ......................................................................................... 75
3. 节点 down 异步检测.............................................................................................. 80
4. 节点 up 异步检测................................................................................................... 92
8. 其它................................................................................................................................. 95
分析代码版本为 erlang 版本 R15B03。
1. 现象与成因
现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态,
分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活
的分区中拉取数据,自身原先的数据丢失。
原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型:
a) 无锁无事务脏写,一阶段异步;
b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务;
c) 有锁同步事务,一阶段同步锁,两阶段同步事务;
d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务;
e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority
事务;
2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商
工作:
a) 节点发现;
b) 节点协议版本协商;
c) 节点 schema 合并;
d) 节点事务 decision 合并;
i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告
{inconsistent_database, bad_decision, Node},本节点事务结果改为 abort;
ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为
abort,此时远程节点将进行修改和通报;
iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节
点事务结果,远程节点进行修改;
iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务
结果的节点启动,并按照其结果作为事务结果;
v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
5. vi. 事务 decision 并不真正影响实际的数据内容;
e) 节点表数据合并:
i. 若本节点为 master 节点,则本节点从磁盘加载表数据;
ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据;
iii. 若远程节点存活,则从远程节点拉取表数据;
iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数
据;
v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动
加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访
问;
vi. 若表数据已经加载,则不会再从远程节点拉取表数据;
vii. 从集群角度看:
1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图;
2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视
,
图;
3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图,
各个分区依旧保持分区状态;
3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对
事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不
一致,此时将通告应用者一个 inconsistent_database 事件:
a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在
远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
6. Node};
b) 重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在
远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network,
Node};
c) 运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit
事务,即通告{inconsistent_database, bad_decision, Node};
3. 常见问题与注意事项
此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完
备一些,也不包含 schema 操作。
fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。
常见问题:
1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保
持 fail_safe 状态?
若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority,
此时集群变为可写;
若没有任何其他节点启动,则本节点一致保持 fail_safe 状态;
2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后,
在 minority 分区写入,此时 minority 如何进入 fail_safe 状态?
mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机
将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。
而
7. 双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态;
3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A,
有什么效果?重启 B、C 有什么效果?
经过试验得出:
a) 若重启 A,则在 A 中能正确发现 B、 写入的记录,
C 这依赖于 A 启动时的协商过程,
A 向 B、C 请求表数据;
b) 若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协
商过程,B、C 向 A 请求表数据;
注意事项:
1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master
节点,由他来仲裁最终的数据结果,但这样也会引入单点问题;
2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动,
时
一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件;
3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅;
4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行
事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能;
5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点
表的恢复依据;
6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报;
7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
8. 4. 源码分析
主题包括:
1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1
的工作过程;
2. 此处使用 majority 事务进行解释,
必须参考 mnesia:change_table_majority/2 的工作过程,
且此过程是 schema 事务,可以更详细全面的理解 majority 事务;
3. majority 事务处理将弱化 schema 事务模型,进行特定的解释;
4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载;
5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件;
1. mnesia:create_schema/1 的工作过程
1. 主体过程
安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。
schema 添加的过程本质上是一个两阶段提交过程:
schema 变更发起节点
1. 询问各个参与节点是否已经由 schema 副本
2. 上全局锁{mnesia_table_lock, schema}
3. 在各个参与节点上建立 mnesia_fallback 进程
4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保
存新生成的 schema 文件备份
9. 5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创
建真正的"FALLBACK.BUP"文件
6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更
2. 前半部分 mnesia:create_schema/1 做的工作
mnesia.erl
create_schema(Ns) ->
mnesia_bup:create_schema(Ns).
mnesia_bup.erl
create_schema([]) ->
create_schema([node()]);
create_schema(Ns) when is_list(Ns) ->
case is_set(Ns) of
true ->
create_schema(Ns, mnesia_schema:ensure_no_schema(Ns));
false ->
{error, {combine_error, Ns}}
end;
create_schema(Ns) ->
{error, {badarg, Ns}}.
mnesia_schema.erl
ensure_no_schema([H|T]) when is_atom(H) ->
case rpc:call(H, ?MODULE, remote_read_schema, []) of
{badrpc, Reason} ->
{H, {"All nodes not running", H, Reason}};
{ok,Source, _} when Source /= default ->
{H, {already_exists, H}};
_ ->
ensure_no_schema(T)
end;
ensure_no_schema([H|_]) ->
{error,{badarg, H}};
ensure_no_schema([]) ->
ok.
remote_read_schema() ->
case mnesia_lib:ensure_loaded(?APPLICATION) of
ok ->
10. case mnesia_monitor:get_env(schema_location) of
opt_disc ->
read_schema(false);
_ ->
read_schema(false)
end;
{error, Reason} ->
{error, Reason}
end.
询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所
有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。
回到 mnesia_bup.erl
mnesia_bup.erl
create_schema(Ns, ok) ->
case mnesia_lib:ensure_loaded(?APPLICATION) of
ok ->
case mnesia_monitor:get_env(schema_location) of
ram ->
{error, {has_no_disc, node()}};
_ ->
case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of
{error, What} ->
{error, What};
ok ->
Mod = mnesia_backup,
Str = mk_str(),
File = mnesia_lib:dir(Str),
file:delete(File),
case catch make_initial_backup(Ns, File, Mod) of
{ok, _Res} ->
case do_install_fallback(File, Mod) of
ok ->
file:delete(File),
ok;
{error, Reason} ->
{error, Reason}
end;
{error, Reason} ->
{error, Reason}
end
11. end
end;
{error, Reason} ->
{error, Reason}
end;
create_schema(_Ns, {error, Reason}) ->
{error, Reason};
create_schema(_Ns, Reason) ->
{error, Reason}.
通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通
过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema:
make_initial_backup(Ns, Opaque, Mod) ->
Orig = mnesia_schema:get_initial_schema(disc_copies, Ns),
Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)),
Schema = [{schema, schema, Modded}],
O2 = do_apply(Mod, open_write, [Opaque], Opaque),
O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2),
O4 = do_apply(Mod, write, [O3, Schema], O3),
O5 = do_apply(Mod, commit_write, [O4], O4),
{ok, O5}.
创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份
中。
mnesia_schema.erl
get_initial_schema(SchemaStorage, Nodes) ->
Cs = #cstruct{name = schema,
record_name = schema,
attributes = [table, cstruct]},
Cs2 =
case SchemaStorage of
ram_copies -> Cs#cstruct{ram_copies = Nodes};
disc_copies -> Cs#cstruct{disc_copies = Nodes}
end,
cs2list(Cs2).
mnesia_bup.erl
do_install_fallback(Opaque, Mod) when is_atom(Mod) ->
do_install_fallback(Opaque, [{module, Mod}]);
do_install_fallback(Opaque, Args) when is_list(Args) ->
case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
12. {ok, FA} ->
do_install_fallback(FA);
{error, Reason} ->
{error, Reason}
end;
do_install_fallback(_Opaque, Args) ->
{error, {badarg, Args}}.
检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在
check_fallback_arg_type/2 中,然后进行安装
check_fallback_args([Arg | Tail], FA) ->
case catch check_fallback_arg_type(Arg, FA) of
{'EXIT', _Reason} ->
{error, {badarg, Arg}};
FA2 ->
check_fallback_args(Tail, FA2)
end;
check_fallback_args([], FA) ->
{ok, FA}.
check_fallback_arg_type(Arg, FA) ->
case Arg of
{scope, global} ->
FA#fallback_args{scope = global};
{scope, local} ->
FA#fallback_args{scope = local};
{module, Mod} ->
Mod2 = mnesia_monitor:do_check_type(backup_module, Mod),
FA#fallback_args{module = Mod2};
{mnesia_dir, Dir} ->
FA#fallback_args{mnesia_dir = Dir,
use_default_dir = false};
{keep_tables, Tabs} ->
atom_list(Tabs),
FA#fallback_args{keep_tables = Tabs};
{skip_tables, Tabs} ->
atom_list(Tabs),
FA#fallback_args{skip_tables = Tabs};
{default_op, keep_tables} ->
FA#fallback_args{default_op = keep_tables};
{default_op, skip_tables} ->
FA#fallback_args{default_op = skip_tables}
end.
13. 此处的构造过程记录 module 参数, mnesia_backup,
为 同时记录 opaque 参数,
为新建 schema
文件的文件名。
do_install_fallback(FA) ->
Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]),
Res =
receive
{'EXIT', Pid, Reason} -> % if appl has trapped exit
{error, {'EXIT', Reason}};
{Pid, Res2} ->
case Res2 of
{ok, _} ->
ok;
{error, Reason} ->
{error, {"Cannot install fallback", Reason}}
end
end,
Res.
install_fallback_master(ClientPid, FA) ->
process_flag(trap_exit, true),
State = {start, FA},
Opaque = FA#fallback_args.opaque,
Mod = FA#fallback_args.module,
Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)),
unlink(ClientPid),
ClientPid ! {self(), Res},
exit(shutdown).
从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup,
Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。
fallback_args 默认定义:
-record(fallback_args, {opaque,
scope = global,
module = mnesia_monitor:get_env(backup_module),
use_default_dir = true,
mnesia_dir,
fallback_bup,
fallback_tmp,
skip_tables = [],
14. keep_tables = [],
default_op = keep_tables
}).
iterate(Mod, Fun, Opaque, Acc) ->
R = #restore{bup_module = Mod, bup_data = Opaque},
case catch read_schema_section(R) of
{error, Reason} ->
{error, Reason};
{R2, {Header, Schema, Rest}} ->
case catch iter(R2, Header, Schema, Fun, Acc, Rest) of
{ok, R3, Res} ->
catch safe_apply(R3, close_read, [R3#restore.bup_data]),
{ok, Res};
{error, Reason} ->
catch safe_apply(R2, close_read, [R2#restore.bup_data]),
{error, Reason};
{'EXIT', Pid, Reason} ->
catch safe_apply(R2, close_read, [R2#restore.bup_data]),
{error, {'EXIT', Pid, Reason}};
{'EXIT', Reason} ->
catch safe_apply(R2, close_read, [R2#restore.bup_data]),
{error, {'EXIT', Reason}}
end
end.
iter(R, Header, Schema, Fun, Acc, []) ->
case safe_apply(R, read, [R#restore.bup_data]) of
{R2, []} ->
Res = Fun([], Header, Schema, Acc),
{ok, R2, Res};
{R2, BupItems} ->
iter(R2, Header, Schema, Fun, Acc, BupItems)
end;
iter(R, Header, Schema, Fun, Acc, BupItems) ->
Acc2 = Fun(BupItems, Header, Schema, Acc),
iter(R, Header, Schema, Fun, Acc2, []).
read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结
构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数:
restore_recs(Recs, Header, Schema, {start, FA}) ->
%% No records in backup
Schema2 = convert_schema(Header#log_header.log_version, Schema),
CreateList = lookup_schema(schema, Schema2),
15. case catch mnesia_schema:list2cs(CreateList) of
{'EXIT', Reason} ->
throw({error, {"Bad schema in restore_recs", Reason}});
Cs ->
Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies),
global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity),
Args = [self(), FA],
Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns],
send_fallback(Pids, {start, Header, Schema2}),
Res = restore_recs(Recs, Header, Schema2, Pids),
global:del_lock({{mnesia_table_lock, schema}, self()}, Ns),
Res
end;
一个典型的 schema 结构如下:
[{schema,schema,
[{name,schema},
{type,set},
{ram_copies,[]},
{disc_copies,['rds_la_dev@10.232.64.77']},
{disc_only_copies,[]},
{load_order,0},
{access_mode,read_write},
{index,[]},
{snmp,[]},
{local_content,false},
{record_name,schema},
{attributes,[table,cstruct]},
{user_properties,[]},
{frag_properties,[]},
{cookie,{{1358,676768,107058},'rds_la_dev@10.232.64.77'}},
{version,{{2,0},[]}}]}]
构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将
CreateList 还原回 schema 的 cstruct 结构。
mnesia_bup.erl
restore_recs(Recs, Header, Schema, {start, FA}) ->
%% No records in backup
Schema2 = convert_schema(Header#log_header.log_version, Schema),
CreateList = lookup_schema(schema, Schema2),
case catch mnesia_schema:list2cs(CreateList) of
{'EXIT', Reason} ->
throw({error, {"Bad schema in restore_recs", Reason}});
16. Cs ->
Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies),
global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity),
Args = [self(), FA],
Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns],
send_fallback(Pids, {start, Header, Schema2}),
Res = restore_recs(Recs, Header, Schema2, Pids),
global:del_lock({{mnesia_table_lock, schema}, self()}, Ns),
Res
end;
get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema
构建的节点。
构建过程要加入集群的全局锁{mnesia_table_lock, schema}。
在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程,
处理 schema 的变更。
向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。
所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程:
restore_recs([], _Header, _Schema, Pids) ->
send_fallback(Pids, swap),
send_fallback(Pids, stop),
stop;
restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整
个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。
进入 fallback_receiver 进程的处理过程:
fallback_receiver(Master, FA) ->
process_flag(trap_exit, true),
case catch register(mnesia_fallback, self()) of
{'EXIT', _} ->
Reason = {already_exists, node()},
local_fallback_error(Master, Reason);
true ->
FA2 = check_fallback_dir(Master, FA),
Bup = FA2#fallback_args.fallback_bup,
case mnesia_lib:exists(Bup) of
17. true ->
Reason2 = {already_exists, node()},
local_fallback_error(Master, Reason2);
false ->
Mod = mnesia_backup,
Tmp = FA2#fallback_args.fallback_tmp,
R = #restore{mode = replace,
bup_module = Mod,
bup_data = Tmp},
file:delete(Tmp),
case catch fallback_receiver_loop(Master, R, FA2, schema) of
{error, Reason} ->
local_fallback_error(Master, Reason);
Other ->
exit(Other)
end
end
end.
在自身的节点上注册进程名字为 mnesia_fallback。
构建初始化状态。
进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。
fallback_receiver_loop(Master, R, FA, State) ->
receive
{Master, {start, Header, Schema}} when State =:= schema ->
Dir = FA#fallback_args.mnesia_dir,
throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)),
R2 = safe_apply(R, open_write, [R#restore.bup_data]),
R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]),
BupSchema = [schema2bup(S) || S <- Schema],
R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]),
Master ! {self(), ok},
fallback_receiver_loop(Master, R4, FA, records);
…
end.
在本地也创建一个 schema 临时文件,
接收来自变更发起节点构建的 header 部分和新 schema。
fallback_receiver_loop(Master, R, FA, State) ->
receive
…
{Master, swap} when State =/= schema ->
?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
18. safe_apply(R, commit_write, [R#restore.bup_data]),
Bup = FA#fallback_args.fallback_bup,
Tmp = FA#fallback_args.fallback_tmp,
throw_bad_res(ok, file:rename(Tmp, Bup)),
catch mnesia_lib:set(active_fallback, true),
?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []),
Master ! {self(), ok},
fallback_receiver_loop(Master, R, FA, stop);
…
end.
mnesia_backup.erl
commit_write(OpaqueData) ->
B = OpaqueData,
case disk_log:sync(B#backup.file_desc) of
ok ->
case disk_log:close(B#backup.file_desc) of
ok ->
case file:rename(B#backup.tmp_file, B#backup.file) of
ok ->
{ok, B#backup.file};
{error, Reason} ->
{error, Reason}
end;
{error, Reason} ->
{error, Reason}
end;
{error, Reason} ->
{error, Reason}
end.
变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一
个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名
为真正的新建的 schema 文件名,消除最后的".BUPTMP"
fallback_receiver_loop(Master, R, FA, State) ->
receive
…
{Master, swap} when State =/= schema ->
?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
safe_apply(R, commit_write, [R#restore.bup_data]),
Bup = FA#fallback_args.fallback_bup,
19. Tmp = FA#fallback_args.fallback_tmp,
throw_bad_res(ok, file:rename(Tmp, Bup)),
catch mnesia_lib:set(active_fallback, true),
?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []),
Master ! {self(), ok},
fallback_receiver_loop(Master, R, FA, stop);
…
end.
在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的
active_fallback 属性,表明称为一个活动 fallback 节点。
fallback_receiver_loop(Master, R, FA, State) ->
receive
…
{Master, stop} when State =:= stop ->
stopped;
…
end.
收到 stop 消息后,mnesia_fallback 进程退出。
3. 后半部分 mnesia:start/0 做的工作
mnesia 启 动 , 则 可 以 自 动 通 过 事 务 管 理 器 mnesia_tm 调 用
mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中:
mnesia_bup.erl
tm_fallback_start(IgnoreFallback) ->
mnesia_schema:lock_schema(),
Res = do_fallback_start(fallback_exists(), IgnoreFallback),
mnesia_schema: unlock_schema(),
case Res of
ok -> ok;
{error, Reason} -> exit(Reason)
end.
锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表
锁
20. do_fallback_start(true, false) ->
verbose("Starting from fallback...~n", []),
BupFile = fallback_bup(),
Mod = mnesia_backup,
LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]),
case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of
…
end.
根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复
restore_tables(Recs, Header, Schema, {start, LocalTabs}) ->
Dir = mnesia_lib:dir(),
OldDir = filename:join([Dir, "OLD_DIR"]),
mnesia_schema:purge_dir(OldDir, []),
mnesia_schema:purge_dir(Dir, [fallback_name()]),
init_dat_files(Schema, LocalTabs),
State = {new, LocalTabs},
restore_tables(Recs, Header, Schema, State);
init_dat_files(Schema, LocalTabs) ->
TmpFile = mnesia_lib:tab2tmp(schema),
Args = [{file, TmpFile}, {keypos, 2}, {type, set}],
case dets:open_file(schema, Args) of % Assume schema lock
{ok, _} ->
create_dat_files(Schema, LocalTabs),
ok = dets:close(schema),
LocalTab = #local_tab{
name = schema,
storage_type = disc_copies,
open = undefined,
add = undefined,
close = undefined,
swap = undefined,
record_name = schema,
opened = false},
?ets_insert(LocalTabs, LocalTab);
{error, Reason} ->
throw({error, {"Cannot open file", schema, Args, Reason}})
end.
创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数
据恢复到新建的 schema 的 dets 表中。
21. 调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后
调用之,将其它表的元数据持久化到 schema 表中。
restore_tables(Recs, Header, Schema, {start, LocalTabs}) ->
Dir = mnesia_lib:dir(),
OldDir = filename:join([Dir, "OLD_DIR"]),
mnesia_schema:purge_dir(OldDir, []),
mnesia_schema:purge_dir(Dir, [fallback_name()]),
init_dat_files(Schema, LocalTabs),
State = {new, LocalTabs},
restore_tables(Recs, Header, Schema, State);
构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数
restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) ->
Tab = element(1, Rec),
case ?ets_lookup(LocalTabs, Tab) of
[] ->
State = {not_local, LocalTabs, Tab},
restore_tables(Recs, Header, Schema, State);
[LT] when is_record(LT, local_tab) ->
State = {local, LocalTabs, LT},
case LT#local_tab.opened of
true -> ignore;
false ->
(LT#local_tab.open)(Tab, LT),
?ets_insert(LocalTabs,LT#local_tab{opened=true})
end,
restore_tables(All, Header, Schema, State)
end;
打开表,不断检查表是否位于本地,若是则进行恢复添加过程:
restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) ->
Tab = element(1, Rec),
if
Tab =:= LT#local_tab.name ->
Key = element(2, Rec),
(LT#local_tab.add)(Tab, Key, Rec, LT),
restore_tables(Recs, Header, Schema, State);
true ->
NewState = {new, LocalTabs},
restore_tables(All, Header, Schema, NewState)
end;
Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
22. 待所有表恢复完成后,进行真正的提交工作:
do_fallback_start(true, false) ->
verbose("Starting from fallback...~n", []),
BupFile = fallback_bup(),
Mod = mnesia_backup,
LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]),
case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of
{ok, _Res} ->
catch dets:close(schema),
TmpSchema = mnesia_lib:tab2tmp(schema),
DatSchema = mnesia_lib:tab2dat(schema),
AllLT = ?ets_match_object(LocalTabs, '_'),
?ets_delete_table(LocalTabs),
case file:rename(TmpSchema, DatSchema) of
ok ->
[(LT#local_tab.swap)(LT#local_tab.name, LT) ||
LT <- AllLT, LT#local_tab.name =/= schema],
file:delete(BupFile),
ok;
{error, Reason} ->
file:delete(TmpSchema),
{error, {"Cannot start from fallback. Rename error.", Reason}}
end;
{error, Reason} ->
{error, {"Cannot start from fallback", Reason}};
{'EXIT', Reason} ->
{error, {"Cannot start from fallback", Reason}}
end.
将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更
同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于
ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的
文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。
全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。
事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema:
mnesia_schema.erl
23. init(IgnoreFallback) ->
Res = read_schema(true, IgnoreFallback),
{ok, Source, _CreateList} = exit_on_error(Res),
verbose("Schema initiated from: ~p~n", [Source]),
set({schema, tables}, []),
set({schema, local_tables}, []),
Tabs = set_schema(?ets_first(schema)),
lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs),
set({schema, where_to_read}, node()),
set({schema, load_node}, node()),
set({schema, load_reason}, initial),
mnesia_controller:add_active_replica(schema, node()).
检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始
信息,并将本节点作为 schema 表的初始活动副本
若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须
同时包含该节点。
4. mnesia:change_table_majority/2 的工作过程
mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia:
change_table_majority/2 更改此属性。
该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大
多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的
一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。
1. 调用接口
mnesia.erl
change_table_majority(T, M) ->
mnesia_schema:change_table_majority(T, M).
24. mnesia_schema.erl
change_table_majority(Tab, Majority) when is_boolean(Majority) ->
schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end).
schema_transaction(Fun) ->
case get(mnesia_activity_state) of
undefined ->
Args = [self(), Fun, whereis(mnesia_controller)],
Pid = spawn_link(?MODULE, schema_coordinator, Args),
receive
{transaction_done, Res, Pid} -> Res;
{'EXIT', Pid, R} -> {aborted, {transaction_crashed, R}}
end;
_ ->
{aborted, nested_transaction}
end.
启动一个 schema 事务的协调者 schema_coordinator 进程。
schema_coordinator(Client, Fun, Controller) when is_pid(Controller) ->
link(Controller),
unlink(Client),
Res = mnesia:transaction(Fun),
Client ! {transaction_done, Res, self()},
unlink(Controller), % Avoids spurious exit message
unlink(whereis(mnesia_tm)), % Avoids spurious exit message
exit(normal).
与普通事务不同,
schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程,
而是 mnesia_controller 进程。
启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。
2. 事务操作
do_change_table_majority(schema, _Majority) ->
mnesia:abort({bad_type, schema});
do_change_table_majority(Tab, Majority) ->
TidTs = get_tid_ts_and_lock(schema, write),
get_tid_ts_and_lock(Tab, none),
insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
25. 可以看出,不能修改 schema 表的 majority 属性。
对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁
get_tid_ts_and_lock(Tab, Intent) ->
TidTs = get(mnesia_activity_state),
case TidTs of
{_Mod, Tid, Ts} when is_record(Ts, tidstore)->
Store = Ts#tidstore.store,
case Intent of
read -> mnesia_locker:rlock_table(Tid, Store, Tab);
write -> mnesia_locker:wlock_table(Tid, Store, Tab);
none -> ignore
end,
TidTs;
_ ->
mnesia:abort(no_transaction)
end.
上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。
do_change_table_majority(Tab, Majority) ->
TidTs = get_tid_ts_and_lock(schema, write),
get_tid_ts_and_lock(Tab, none),
insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
关注实际的 majority 属性的修改动作:
make_change_table_majority(Tab, Majority) ->
ensure_writable(schema),
Cs = incr_version(val({Tab, cstruct})),
ensure_active(Cs),
OldMajority = Cs#cstruct.majority,
Cs2 = Cs#cstruct{majority = Majority},
FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of
{_, Tab} ->
FragNames = mnesia_frag:frag_names(Tab) -- [Tab],
lists:map(
fun(T) ->
get_tid_ts_and_lock(Tab, none),
CsT = incr_version(val({T, cstruct})),
ensure_active(CsT),
CsT2 = CsT#cstruct{majority = Majority},
verify_cstruct(CsT2),
{op, change_table_majority, vsn_cs2list(CsT2),
OldMajority, Majority}
26. end, FragNames);
false -> [];
{_, _} -> mnesia:abort({bad_type, Tab})
end,
verify_cstruct(Cs2),
[{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].
通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的
schema 节点。
通过 incr_version 更新表的版本号。
通过 ensure_active 检查所有表的副本节点是否存活,
即与副本节点进行表的全局视图确认。
修改表的元数据版本号:
incr_version(Cs) ->
{{Major, Minor}, _} = Cs#cstruct.version,
Nodes = mnesia_lib:intersect(val({schema, disc_copies}),
mnesia_lib:cs_to_nodes(Cs)),
V=
case Nodes -- val({Cs#cstruct.name, active_replicas}) of
[] -> {Major + 1, 0}; % All replicas are active
_ -> {Major, Minor + 1} % Some replicas are inactive
end,
Cs#cstruct{version = {V, {node(), now()}}}.
mnesia_lib.erl
cs_to_nodes(Cs) ->
Cs#cstruct.disc_only_copies ++
Cs#cstruct.disc_copies ++
Cs#cstruct.ram_copies.
重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的
节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则
主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符,
这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当
于时空序列+单调递增序列。版本号的计算类似于 NDB。
检查表的全局视图:
27. ensure_active(Cs) ->
ensure_active(Cs, active_replicas).
ensure_active(Cs, What) ->
Tab = Cs#cstruct.name,
W = {Tab, What},
ensure_non_empty(W),
Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)),
case Nodes -- val(W) of
[] ->
ok;
Ns ->
Expl = "All replicas on diskfull nodes are not active yet",
case val({Tab, local_content}) of
true ->
case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of
{Replies, []} ->
check_active(Replies, Expl, Tab);
{_Replies, BadNs} ->
mnesia:abort({not_active, Expl, Tab, BadNs})
end;
false ->
mnesia:abort({not_active, Expl, Tab, Ns})
end
end.
is_remote_member(Key) ->
IsActive = lists:member(node(), val(Key)),
{IsActive, node()}.
为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点,
却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问
该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态
的不一致认知。
make_change_table_majority(Tab, Majority) ->
ensure_writable(schema),
Cs = incr_version(val({Tab, cstruct})),
ensure_active(Cs),
OldMajority = Cs#cstruct.majority,
Cs2 = Cs#cstruct{majority = Majority},
FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of
{_, Tab} ->
28. FragNames = mnesia_frag:frag_names(Tab) -- [Tab],
lists:map(
fun(T) ->
get_tid_ts_and_lock(Tab, none),
CsT = incr_version(val({T, cstruct})),
ensure_active(CsT),
CsT2 = CsT#cstruct{majority = Majority},
verify_cstruct(CsT2),
{op, change_table_majority, vsn_cs2list(CsT2),
OldMajority, Majority}
end, FragNames);
false -> [];
{_, _} -> mnesia:abort({bad_type, Tab})
end,
verify_cstruct(Cs2),
[{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].
变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项
成员的类型,内容是否合乎要求。
vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。
生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。
do_change_table_majority(Tab, Majority) ->
TidTs = get_tid_ts_and_lock(schema, write),
get_tid_ts_and_lock(Tab, none),
insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct
组成的 proplist, OldMajority, Majority}]
insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) ->
do_insert_schema_ops(Ts#tidstore.store, SchemaIOps).
do_insert_schema_ops(Store, [Head | Tail]) ->
?ets_insert(Store, Head),
do_insert_schema_ops(Store, Tail);
do_insert_schema_ops(_Store, []) ->
ok.
可以看到,
插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。
这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
29. op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过
程。
3. schema 事务提交接口
mnesia_tm.erl
t_commit(Type) ->
{_Mod, Tid, Ts} = get(mnesia_activity_state),
Store = Ts#tidstore.store,
if
Ts#tidstore.level == 1 ->
intercept_friends(Tid, Ts),
case arrange(Tid, Store, Type) of
{N, Prep} when N > 0 ->
multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store);
{0, Prep} ->
multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store)
end;
true ->
%% nested commit
Level = Ts#tidstore.level,
[{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores,
req({del_store, Tid, Store, Obsolete, false}),
NewTs = Ts#tidstore{store = Store,
up_stores = Tail,
level = Level - 1},
NewTidTs = {OldMod, Tid, NewTs},
put(mnesia_activity_state, NewTidTs),
do_commit_nested
end.
首先在操作重排时进行检查:
arrange(Tid, Store, Type) ->
%% The local node is always included
Nodes = get_elements(nodes,Store),
Recs = prep_recs(Nodes, []),
Key = ?ets_first(Store),
N = 0,
Prep =
case Type of
async -> #prep{protocol = sym_trans, records = Recs};
30. sync -> #prep{protocol = sync_sym_trans, records = Recs}
end,
case catch do_arrange(Tid, Store, Key, Prep, N) of
{'EXIT', Reason} ->
dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]),
case Reason of
{aborted, R} ->
mnesia:abort(R);
_ ->
mnesia:abort(Reason)
end;
{New, Prepared} ->
{New, Prepared#prep{records = reverse(Prepared#prep.records)}}
end.
Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。
do_arrange(Tid, Store, {Tab, Key}, Prep, N) ->
Oid = {Tab, Key},
Items = ?ets_lookup(Store, Oid), %% Store is a bag
P2 = prepare_items(Tid, Tab, Key, Items, Prep),
do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1);
do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op ->
Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag
P2 = prepare_schema_items(Tid, Items, Prep),
do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1);
可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op,
change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事
务使用不同的提交协议:
prepare_schema_items(Tid, Items, Prep) ->
Types = [{N, schema_ops} || N <- val({current, db_nodes})],
Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema),
Prep#prep{protocol = asym_trans, records = Recs}.
prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设
置为 asym_trans。
prepare_node(_Node, _Storage, Items, Rec, Kind)
when Kind == schema, Rec#commit.schema_ops == [] ->
Rec#commit{schema_ops = Items};
t_commit(Type) ->
31. {_Mod, Tid, Ts} = get(mnesia_activity_state),
Store = Ts#tidstore.store,
if
Ts#tidstore.level == 1 ->
intercept_friends(Tid, Ts),
case arrange(Tid, Store, Type) of
{N, Prep} when N > 0 ->
multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store);
{0, Prep} ->
multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store)
end;
true ->
%% nested commit
Level = Ts#tidstore.level,
[{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores,
req({del_store, Tid, Store, Obsolete, false}),
NewTs = Ts#tidstore{store = Store,
up_stores = Tail,
level = Level - 1},
NewTidTs = {OldMod, Tid, NewTs},
put(mnesia_activity_state, NewTidTs),
do_commit_nested
end.
提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作,
有
recover_coordinator 过程,restore_op 操作。
4. schema 事务协议过程
multi_commit(asym_trans, Majority, Tid, CR, Store) ->
D = #decision{tid = Tid, outcome = presume_abort},
{D2, CR2} = commit_decision(D, CR, [], []),
DiscNs = D2#decision.disc_nodes,
RamNs = D2#decision.ram_nodes,
case have_majority(Majority, DiscNs ++ RamNs) of
ok -> ok;
{error, Tab} -> mnesia:abort({no_majority, Tab})
end,
Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs),
?ets_insert(Store, Pending),
{WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
32. SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),
{Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []),
?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]),
case Votes of
do_commit ->
case SchemaPrep of
{_Modified, C = #commit{}, DumperMode} ->
mnesia_log:log(C), % C is not a binary
?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]),
D3 = C#commit.decision,
D4 = D3#decision{outcome = unclear},
mnesia_recover:log_decision(D4),
?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]),
tell_participants(Pids, {Tid, pre_commit}),
rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []);
{'EXIT', Reason} ->
mnesia_recover:note_decision(Tid, aborted),
?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]),
tell_participants(Pids, {Tid, {do_abort, Reason}}),
do_abort(Tid, Local),
{do_abort, Reason}
end;
{do_abort, Reason} ->
mnesia_recover:note_decision(Tid, aborted),
?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]),
tell_participants(Pids, {Tid, {do_abort, Reason}}),
do_abort(Tid, Local),
{do_abort, Reason}
end.
事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下:
1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存
副本节点数的一半,等于一半时亦不满足
2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点
3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协
议类型为 asym_trans
4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
33. 交过程
注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变
低
5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于
change_table_majority,没有什么需要 prepare 的
6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes
7. 发起节点收到所有参与节点的同意提交消息
8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有
什么需要 prepare 的
9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志
10. 发起节点记录第一阶段恢复日志 presume_abort;
11. 发起节点记录第二阶段恢复日志 unclear
12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit
13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交
14. 参与节点记录第一阶段恢复日志 presume_abort
15. 参与节点记录第二阶段恢复日志 unclear
16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit
17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节
点,用于崩溃恢复过程
18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed
19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed
b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
34. 日志 committed
20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成
b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过
do_commit 完成
21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant
进程的 schema 操作的提交结果
b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点
返回 schema_commit
22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源
b.参与节点 commit_participant 进程释放锁和事务资源
5. 远程节点事务管理器第一阶段提交 prepare 响应
参与节点事务管理器收到第一阶段提交的消息后:
mnesia.erl
doit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) ->
…
{From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} ->
?eval_debug_fun({?MODULE, doit_ask_commit},
[{tid, Tid}, {prot, Protocol}]),
mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs),
Pid =
case Protocol of
asym_trans when node(Tid#tid.pid) /= node() ->
Args = [tmpid(From), Tid, Commit, DiscNs, RamNs],
spawn_link(?MODULE, commit_participant, Args);
_ when node(Tid#tid.pid) /= node() -> %% *_sym_trans
reply(From, {vote_yes, Tid}),
nopid
end,
P = #participant{tid = Tid,
35. pid = Pid,
commit = Commit,
disc_nodes = DiscNs,
ram_nodes = RamNs,
protocol = Protocol},
State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)},
doit_loop(State2);
…
创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘
节点列表,内存节点列表],辅助事务提交:
commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) ->
process_flag(trap_exit, true),
Commit = binary_to_term(Bin),
commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs);
commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) ->
process_flag(trap_exit, true),
commit_participant(Coord, Tid, C, C, DiscNs, RamNs).
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of
{Modified, C = #commit{}, DumperMode} ->、
case lists:member(node(), DiscNs) of
false ->
ignore;
true ->
case Modified of
false -> mnesia_log:log(Bin);
true -> mnesia_log:log(C)
end
end,
?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]),
reply(Coord, {vote_yes, Tid, self()}),
…
参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作:
mnesia_schema.erl
prepare_commit(Tid, Commit, WaitFor) ->
case Commit#commit.schema_ops of
[] ->
{false, Commit, optional};
36. OrigOps ->
{Modified, Ops, DumperMode} =
prepare_ops(Tid, OrigOps, WaitFor, false, [], optional),
InitBy = schema_prepare,
GoodRes = {Modified,
Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode},
case DumperMode of
optional ->
dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]);
mandatory ->
case mnesia_controller:sync_dump_log(InitBy) of
dumped -> GoodRes;
{error, Reason} -> mnesia:abort(Reason)
end
end,
case Ops of
[] -> ignore;
_ -> mnesia_controller:wait_for_schema_commit_lock()
end,
GoodRes
end.
注意此处,包含三个主要分支:
1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容,
optional},这适用于 majority 类表的操作
2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running,
sync_trans , create_table , delete_table , add_table_copy , del_table_copy ,
change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能
但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync
日志,这适用于出现上述操作的时候
3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true,
原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority
就属于这类操作
37. 6. 远程节点事务参与者第二阶段提交 precommit 响应
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
…
receive
{Tid, pre_commit} ->
D = C#commit.decision,
mnesia_recover:log_decision(D#decision{outcome = unclear}),
?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]),
Expect_schema_ack = C#commit.schema_ops /= [],
reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}),
receive
{Tid, committed} ->
mnesia_recover:log_decision(D#decision{outcome = committed}),
?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]),
do_commit(Tid, C, DumperMode),
case Expect_schema_ack of
false -> ignore;
true -> reply(Coord, {schema_commit, Tid, self()})
end,
?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]);
…
end;
…
参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear,
并返回 acc_pre_commit
7. 请求节点事务发起者收到第二阶段提交 precommit 确认
发起节点收到所有参与节点的 acc_pre_commit 消息后:
rec_acc_pre_commit([], Tid, Store, {Commit,OrigC}, Res, DumperMode, GoodPids,
SchemaAckPids) ->
D = Commit#commit.decision,
case Res of
do_commit ->
38. prepare_sync_schema_commit(Store, SchemaAckPids),
tell_participants(GoodPids, {Tid, committed}),
D2 = D#decision{outcome = committed},
mnesia_recover:log_decision(D2),
?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]),
do_commit(Tid, Commit, DumperMode),
?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]),
sync_schema_commit(Tid, Store, SchemaAckPids),
mnesia_locker:release_tid(Tid),
?MODULE ! {delete_transaction, Tid};
{do_abort, Reason} ->
tell_participants(GoodPids, {Tid, {do_abort, Reason}}),
D2 = D#decision{outcome = aborted},
mnesia_recover:log_decision(D2),
?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]),
do_abort(Tid, OrigC),
?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}])
end,
Res.
prepare_sync_schema_commit(_Store, []) ->
ok;
prepare_sync_schema_commit(Store, [Pid | Pids]) ->
?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}),
prepare_sync_schema_commit(Store, Pids).
发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点
commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地
提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参
与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待:
sync_schema_commit(_Tid, _Store, []) ->
ok;
sync_schema_commit(Tid, Store, [Pid | Tail]) ->
receive
{?MODULE, _, {schema_commit, Tid, Pid}} ->
?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}),
sync_schema_commit(Tid, Store, Tail);
{mnesia_down, Node} when Node == node(Pid) ->
?ets_match_delete(Store, {waiting_for_commit_ack, Node}),
sync_schema_commit(Tid, Store, Tail)
end.
39. 8. 远程节点事务参与者第三阶段提交 commit 响应
参与节点 commit_participant 进程收到提交消息后:
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
…
receive
{Tid, pre_commit} ->
D = C#commit.decision,
mnesia_recover:log_decision(D#decision{outcome = unclear}),
?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]),
Expect_schema_ack = C#commit.schema_ops /= [],
reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}),
receive
{Tid, committed} ->
mnesia_recover:log_decision(D#decision{outcome = committed}),
?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]),
do_commit(Tid, C, DumperMode),
case Expect_schema_ack of
false -> ignore;
true -> reply(Coord, {schema_commit, Tid, self()})
end,
?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]);
…
end;
…
参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志
committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回
schema_commit,否则完成事务。
9. 第三阶段提交 commit 的本地提交过程
do_commit(Tid, C, DumperMode) ->
mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode),
R = do_snmp(Tid, C#commit.snmp),
40. R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R),
R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2),
R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3),
mnesia_subscr:report_activity(Tid),
R4.
这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与
参与节点中。
对于 schema 表的更新包括:
1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更
新表的 where_to_wlock 属性
2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各
个属性
3. 在 schema 的 ets 表中,记录表的 cstruct
4. 在 schema 的 dets 表中,记录表的 cstruct
更新过程如下:
mnesia_dumper.erl
update(_Tid, [], _DumperMode) ->
dumped;
update(Tid, SchemaOps, DumperMode) ->
UseDir = mnesia_monitor:use_dir(),
Res = perform_update(Tid, SchemaOps, DumperMode, UseDir),
mnesia_controller:release_schema_commit_lock(),
Res.
perform_update(_Tid, _SchemaOps, mandatory, true) ->
InitBy = schema_update,
?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]),
opt_dump_log(InitBy);
perform_update(Tid, SchemaOps, _DumperMode, _UseDir) ->
InitBy = fast_schema_update,
InPlace = mnesia_monitor:get_env(dump_log_update_in_place),
?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]),
case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
41. mnesia_log:version()) of
{'EXIT', Reason} ->
Error = {error, {"Schema update error", Reason}},
close_files(InPlace, Error, InitBy),
fatal("Schema update error ~p ~p", [Reason, SchemaOps]);
_ ->
?eval_debug_fun({?MODULE, post_dump}, [InitBy]),
close_files(InPlace, ok, InitBy),
ok
end.
insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) -> ok;
insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"->
insert_op(Tid, Storage, Op, InPlace, InitBy),
ok;
insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"->
insert_op(Tid, Storage, Op, InPlace, InitBy),
insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver);
insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" ->
insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver),
insert_op(Tid, Storage, Op, InPlace, InitBy).
…
insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) ->
Cs = mnesia_schema:list2cs(TabDef),
case InitBy of
startup -> ignore;
_ -> mnesia_controller:change_table_majority(Cs)
end,
insert_cstruct(Tid, Cs, true, InPlace, InitBy);
…
对于 change_table_majority 操作,其本身的格式为:
{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}
此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置
mnesia_controller.erl
change_table_majority(Cs) ->
W = fun() ->
Tab = Cs#cstruct.name,
set({Tab, majority}, Cs#cstruct.majority),
update_where_to_wlock(Tab)
42. end,
update(W).
update_where_to_wlock(Tab) ->
WNodes = val({Tab, where_to_write}),
Majority = case catch val({Tab, majority}) of
true -> true;
_ -> false
end,
set({Tab, where_to_wlock}, {WNodes, Majority}).
该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为
设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分
mnesia_dumper.erl
…
insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) ->
Cs = mnesia_schema:list2cs(TabDef),
case InitBy of
startup -> ignore;
_ -> mnesia_controller:change_table_majority(Cs)
end,
insert_cstruct(Tid, Cs, true, InPlace, InitBy);
…
insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) ->
Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts),
{schema, Tab, _} = Val,
S = val({schema, storage_type}),
disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy),
Tab.
除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新
其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的
表的 cstruct
mnesia_schema.erl
insert_cstruct(Tid, Cs, KeepWhereabouts) ->
Tab = Cs#cstruct.name,
TabDef = cs2list(Cs),
Val = {schema, Tab, TabDef},
mnesia_checkpoint:tm_retain(Tid, schema, Tab, write),
mnesia_subscr:report_table_event(schema, Tid, Val, write),
Active = val({Tab, active_replicas}),
43. case KeepWhereabouts of
true -> ignore;
false when Active == [] -> clear_whereabouts(Tab);
false -> ignore
end,
set({Tab, cstruct}, Cs),
?ets_insert(schema, Val),
do_set_schema(Tab, Cs),
Val.
do_set_schema(Tab) ->
List = get_create_list(Tab),
Cs = list2cs(List),
do_set_schema(Tab, Cs).
do_set_schema(Tab, Cs) ->
Type = Cs#cstruct.type,
set({Tab, setorbag}, Type),
set({Tab, local_content}, Cs#cstruct.local_content),
set({Tab, ram_copies}, Cs#cstruct.ram_copies),
set({Tab, disc_copies}, Cs#cstruct.disc_copies),
set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies),
set({Tab, load_order}, Cs#cstruct.load_order),
set({Tab, access_mode}, Cs#cstruct.access_mode),
set({Tab, majority}, Cs#cstruct.majority),
set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)),
set({Tab, snmp}, Cs#cstruct.snmp),
set({Tab, user_properties}, Cs#cstruct.user_properties),
[set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties],
set({Tab, frag_properties}, Cs#cstruct.frag_properties),
mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties),
set({Tab, storage_properties}, Cs#cstruct.storage_properties),
set({Tab, attributes}, Cs#cstruct.attributes),
Arity = length(Cs#cstruct.attributes) + 1,
set({Tab, arity}, Arity),
RecName = Cs#cstruct.record_name,
set({Tab, record_name}, RecName),
set({Tab, record_validation}, {RecName, Arity, Type}),
set({Tab, wild_pattern}, wild(RecName, Arity)),
set({Tab, index}, Cs#cstruct.index),
%% create actual index tabs later
set({Tab, cookie}, Cs#cstruct.cookie),
set({Tab, version}, Cs#cstruct.version),
set({Tab, cstruct}, Cs),
Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs),
set({Tab, storage_type}, Storage),
44. mnesia_lib:add({schema, tables}, Tab),
Ns = mnesia_lib:cs_to_nodes(Cs),
case lists:member(node(), Ns) of
true ->
mnesia_lib:add({schema, local_tables}, Tab);
false when Tab == schema ->
mnesia_lib:add({schema, local_tables}, Tab);
false ->
ignore
end.
do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等
mnesia_dumper.erl
insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) ->
Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts),
{schema, Tab, _} = Val,
S = val({schema, storage_type}),
disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy),
Tab.
disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) ->
case open_files(Tab, Storage, InPlace, InitBy) of
true ->
case Storage of
disc_copies when Tab /= schema ->
mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}),
ok;
_ ->
dets_insert(Op,Tab,Key,Val)
end;
false ->
ignore
end.
dets_insert(Op,Tab,Key,Val) ->
case Op of
write ->
dets_updated(Tab,Key),
ok = dets:insert(Tab, Val);
…
end.
dets_updated(Tab,Key) ->
case get(mnesia_dumper_dets) of
undefined ->
Empty = gb_trees:empty(),
45. Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty),
put(mnesia_dumper_dets, Tree);
Tree ->
case gb_trees:lookup(Tab,Tree) of
{value, cleared} -> ignore;
{value, Set} ->
T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree),
put(mnesia_dumper_dets, T);
none ->
T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree),
put(mnesia_dumper_dets, T)
end
end.
更新 schema 的 dets 表中记录的表 cstruct。
综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时
有良好的崩溃恢复检测
schema 表的变更包括对多处地方的更新,包括:
1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值
2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各
个属性
3. 在 schema 的 ets 表中,记录表的 cstruct
4. 在 schema 的 dets 表中,记录表的 cstruct
5. majority 事务处理
majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过
程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2
修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
46. 6. 恢复
mnesia 的连接协商过程用于在启动时,结点间交互状态信息:
整个协商包括如下过程:
1. 节点发现,集群遍历
2. 节点协议版本检查
3. 节点 schema 合并
4. 节点 decision 通告与合并
5. 节点数据重新载入与合并
1. 节点协议版本检查+节点 decision 通告与合并
mnesia_recover.erl
connect_nodes(Ns) ->
%%Ns 为要检查的节点
call({connect_nodes, Ns}).
handle_call({connect_nodes, Ns}, From, State) ->
%% Determine which nodes we should try to connect
AlreadyConnected = val(recover_nodes),
{_, Nodes} = mnesia_lib:search_delete(node(), Ns),
Check = Nodes -- AlreadyConnected,
%%开始版本协商
case mnesia_monitor:negotiate_protocol(Check) of
busy ->
%% monitor is disconnecting some nodes retry
%% the req (to avoid deadlock).
erlang:send_after(2, self(), {connect_nodes,Ns,From}),
{noreply, State};
[] ->
%% No good noodes to connect to!
%% We can't use reply here because this function can be
47. %% called from handle_info
gen_server:reply(From, {[], AlreadyConnected}),
{noreply, State};
GoodNodes ->
%% GoodNodes 是协商通过的节点
%% Now we have agreed upon a protocol with some new nodes
%% and we may use them when we recover transactions
mnesia_lib:add_list(recover_nodes, GoodNodes),
%%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision
cast({announce_all, GoodNodes}),
case get_master_nodes(schema) of
[] ->
Context = starting_partitioned_network,
%%检查曾经是否与这些节点出现过分区
mnesia_monitor:detect_inconcistency(GoodNodes, Context);
_ -> %% If master_nodes is set ignore old inconsistencies
ignore
end,
gen_server:reply(From, {GoodNodes, AlreadyConnected}),
{noreply,State}
end;
handle_cast({announce_all, Nodes}, State) ->
announce_all(Nodes),
{noreply, State};
announce_all([]) ->
ok;
announce_all(ToNodes) ->
Tid = trans_tid_serial(),
announce(ToNodes, [{trans_tid,serial,Tid}], [], false).
announce(ToNodes, [Head | Tail], Acc, ForceSend) ->
Acc2 = arrange(ToNodes, Head, Acc, ForceSend),
announce(ToNodes, Tail, Acc2, ForceSend);
announce(_ToNodes, [], Acc, _ForceSend) ->
send_decisions(Acc).
send_decisions([{Node, Decisions} | Tail]) ->
%%注意此处,decision 合并过程是一个异步过程
abcast([Node], {decisions, node(), Decisions}),
send_decisions(Tail);
send_decisions([]) ->
48. ok.
遍历所有协商通过的节点,告知其本节点的历史事务 decision
下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点
handle_cast({decisions, Node, Decisions}, State) ->
mnesia_lib:add(recover_nodes, Node),
State2 = add_remote_decisions(Node, Decisions, State),
{noreply, State2};
接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。
decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构
add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) ->
State2 = add_remote_decision(Node, D, State),
add_remote_decisions(Node, Tail, State2);
add_remote_decisions(Node, [C | Tail], State)
when is_record(C, transient_decision) ->
D = #decision{tid = C#transient_decision.tid,
outcome = C#transient_decision.outcome,
disc_nodes = [],
ram_nodes = []},
State2 = add_remote_decision(Node, D, State),
add_remote_decisions(Node, Tail, State2);
add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) ->
add_remote_decisions(Node, Tail, State);
add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) ->
%%对于发送节点传来的未决事务,接收节点需要继续询问其它节点
sync_trans_tid_serial(Serial),
case State#state.unclear_decision of
undefined -> ignored;
D ->
case lists:member(Node, D#decision.ram_nodes) of
true -> ignore;
false ->
%%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询
问该未决事务的真正结果
abcast([Node], {what_decision, node(), D})
end
49. end,
add_remote_decisions(Node, Tail, State);
add_remote_decisions(_Node, [], State) ->
State.
add_remote_decision(Node, NewD, State) ->
Tid = NewD#decision.tid,
OldD = decision(Tid),
%%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而
发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo
日志进行重构
D = merge_decisions(Node, OldD, NewD),
%%记录合并结果
do_log_decision(D, false, undefined),
Outcome = D#decision.outcome,
if
OldD == no_decision -> ignore;
Outcome == unclear -> ignore;
true ->
case lists:member(node(), NewD#decision.disc_nodes) or
lists:member(node(), NewD#decision.ram_nodes) of
true ->
%%向其它节点告知本节点的 decision 合并结果
tell_im_certain([Node], D);
false -> ignore
end
end,
case State#state.unclear_decision of
U when U#decision.tid == Tid ->
WaitFor = State#state.unclear_waitfor -- [Node],
if
Outcome == unclear, WaitFor == [] ->
%% Everybody are uncertain, lets abort
%%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交
结果,此时决定终止事务
NewOutcome = aborted,
CertainD = D#decision{outcome = NewOutcome,
50. disc_nodes = [],
ram_nodes = []},
tell_im_certain(D#decision.disc_nodes, CertainD),
tell_im_certain(D#decision.ram_nodes, CertainD),
do_log_decision(CertainD, false, undefined),
verbose("Decided to abort transaction ~p "
"since everybody are uncertain ~p~n",
[Tid, CertainD]),
gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome /= unclear ->
%%发送节点知道事务结果,通告事务结果
verbose("~p told us that transaction ~p was ~p~n",
[Node, Tid, Outcome]),
gen_server:reply(State#state.unclear_pid, {ok, Outcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome == unclear ->
%%发送节点也不知道事务结果,此时继续等待
State#state{unclear_waitfor = WaitFor}
end;
_ ->
State
end.
合并策略:
merge_decisions(Node, D, NewD0) ->
NewD = filter_aborted(NewD0),
if
D == no_decision, node() /= Node ->
%% We did not know anything about this txn
NewD#decision{disc_nodes = []};
D == no_decision ->
NewD;
is_record(D, decision) ->
DiscNs = D#decision.disc_nodes -- ([node(), Node]),
OldD = filter_aborted(D#decision{disc_nodes = DiscNs}),
if
51. OldD#decision.outcome == unclear,
NewD#decision.outcome == unclear ->
D;
OldD#decision.outcome == NewD#decision.outcome ->
%% We have come to the same decision
OldD;
OldD#decision.outcome == committed,
NewD#decision.outcome == aborted ->
%%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发
送节点中止事务,此时仍然选择中止事务
Msg = {inconsistent_database, bad_decision, Node},
mnesia_lib:report_system_event(Msg),
OldD#decision{outcome = aborted};
OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
OldD#decision.outcome == committed,
NewD#decision.outcome == unclear -> OldD#decision{outcome = committed};
OldD#decision.outcome == unclear,
NewD#decision.outcome == committed -> OldD#decision{outcome = committed}
end
end.
2. 节点发现,集群遍历
mnesia_controller.erl
merge_schema() ->
AllNodes = mnesia_lib:all_nodes(),
%%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移
case try_merge_schema(AllNodes, [node()], fun default_merge/1) of
ok ->
%%合并 schema 成功后,将进行数据合并
schema_is_merged();
{aborted, {throw, Str}} when is_list(Str) ->
fatal("Failed to merge schema: ~s~n", [Str]);
Else ->
fatal("Failed to merge schema: ~p~n", [Else])
end.
52. try_merge_schema(Nodes, Told0, UserFun) ->
%%开始集群遍历,启动一个 schema 合并事务
case mnesia_schema:merge_schema(UserFun) of
{atomic, not_merged} ->
%% No more nodes that we need to merge the schema with
%% Ensure we have told everybody that we are running
case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of
[] -> ok;
Tell ->
im_running(Tell, [node()]),
ok
end;
{atomic, {merged, OldFriends, NewFriends}} ->
%% Check if new nodes has been added to the schema
Diff = mnesia_lib:all_nodes() -- [node() | Nodes],
mnesia_recover:connect_nodes(Diff),
%% Tell everybody to adopt orphan tables
%%通知所有的集群节点,本节点启动,开始数据合并申请
im_running(OldFriends, NewFriends),
im_running(NewFriends, OldFriends),
Told = case lists:member(node(), NewFriends) of
true -> Told0 ++ OldFriends;
false -> Told0 ++ NewFriends
end,
try_merge_schema(Nodes, Told, UserFun);
{atomic, {"Cannot get cstructs", Node, Reason}} ->
dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]),
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
{aborted, {shutdown, _}} -> %% One of the nodes is going down
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
Other ->
Other
end.
mnesia_schema.erl
merge_schema() ->
schema_transaction(fun() -> do_merge_schema([]) end).
merge_schema(UserFun) ->
schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).
可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
53. 题操作包括:
{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}
{op, merge_schema, CstructList}
这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。
do_merge_schema(LockTabs0) ->
%% 锁 schema 表
{_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write),
LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0],
[get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs],
Connected = val(recover_nodes),
Running = val({current, db_nodes}),
Store = Ts#tidstore.store,
%% Verify that all nodes are locked that might not be the
%% case, if this trans where queued when new nodes where added.
case Running -- ets:lookup_element(Store, nodes, 2) of
[] -> ok; %% All known nodes are locked
Miss -> %% Abort! We don't want the sideeffects below to be executed
mnesia:abort({bad_commit, {missing_lock, Miss}})
end,
%% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点;
Running
是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点;
case Connected -- Running of
%% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进
行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法)
,这个过
程由某个节点发起,
[Node | _] = OtherNodes ->
%% Time for a schema merging party!
mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]),
[mnesia_locker:wlock_no_exist(
Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes))
|| {T,Ns} <- LockTabs],
%% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1
case fetch_cstructs(Node) of
{cstructs, Cstructs, RemoteRunning1} ->
54. LockedAlready = Running ++ [Node],
%% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node
的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这
些结点是否曾有过分区
{New, Old} = mnesia_recover:connect_nodes(RemoteRunning1),
%% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存
Old
活结点,来自于 recover_nodes
RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1),
If
%% RemoteRunning = (New∪Old)∩RemoteRunning1
%% RemoteRunning≠RemoteRunning <=>
%% New∪(Old∩RemoteRunning1) < RemoteRunning1
%%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集
群)中有部分节点不能与本节点相连
RemoteRunning /= RemoteRunning1 ->
mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n",
[node(), RemoteRunning1 -- RemoteRunning]),
mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning});
true -> ok
end,
NeedsLock = RemoteRunning -- LockedAlready,
mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock),
[mnesia_locker:wlock_no_exist(Tid, Store, T,
mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs],
NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready),
{value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs),
SchemaDef = cs2list(NeedsConversion, SchemaCs),
%% Announce that Node is running
%%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集
群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并
A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
55. do_insert_schema_ops(Store, A),
%% Introduce remote tables to local node
%%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提
交成功后由 mnesia_dumper 执行生效
do_insert_schema_ops(Store, make_merge_schema(Node, NeedsConversion,
Cstructs)),
%% Introduce local tables to remote nodes
Tabs = val({schema, tables}),
Ops = [{op, merge_schema, get_create_list(T)}
|| T <- Tabs,
not lists:keymember(T, #cstruct.name, Cstructs)],
do_insert_schema_ops(Store, Ops),
%%Ensure that the txn will be committed on all nodes
%%向另一个可连接集群中的所有节点通告本节点正在加入集群
NewNodes = RemoteRunning -- Running,
mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
announce_im_running(NewNodes, SchemaCs),
{merged, Running, RemoteRunning};
{error, Reason} ->
{"Cannot get cstructs", Node, Reason};
{badrpc, Reason} ->
{"Cannot get cstructs", Node, {badrpc, Reason}}
end;
[] ->
%% No more nodes to merge schema with
not_merged
end.
announce_im_running([N | Ns], SchemaCs) ->
%%与新的可连接集群的节点经过协商
{L1, L2} = mnesia_recover:connect_nodes([N]),
case lists:member(N, L1) or lists:member(N, L2) of
true ->
%%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是
立即生效的,而不会延迟到事务提交
mnesia_lib:add({current, db_nodes}, N),
mnesia_controller:add_active_replica(schema, N, SchemaCs);
56. false ->
%%若协商未通过,则中止事务,此时会通过 announce_im_running 的 undo 动作,将新
加入的事务节点全部剥离
mnesia_lib:error("Mnesia on ~p could not connect to node ~p~n",
[node(), N]),
mnesia:abort({node_not_running, N})
end,
announce_im_running(Ns, SchemaCs);
announce_im_running([], _) ->
[].
schema 操作在三阶段提交时,mnesia_tm 首先要进行 prepare:
mnesia_tm.erl
multi_commit(asym_trans, Majority, Tid, CR, Store) ->
…
SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),
…
mnesia_schema.erl
prepare_commit(Tid, Commit, WaitFor) ->
case Commit#commit.schema_ops of
[] ->
{false, Commit, optional};
OrigOps ->
{Modified, Ops, DumperMode} =
prepare_ops(Tid, OrigOps, WaitFor, false, [], optional),
…
end.
prepare_ops(Tid, [Op | Ops], WaitFor, Changed, Acc, DumperMode) ->
case prepare_op(Tid, Op, WaitFor) of
…
{false, optional} ->
prepare_ops(Tid, Ops, WaitFor, true, Acc, DumperMode)
end;
prepare_ops(_Tid, [], _WaitFor, Changed, Acc, DumperMode) ->
{Changed, Acc, DumperMode}.
prepare_op(_Tid, {op, announce_im_running, Node, SchemaDef, Running, RemoteRunning},
_WaitFor) ->
SchemaCs = list2cs(SchemaDef),
if
Node == node() -> %% Announce has already run on local node
57. ignore; %% from do_merge_schema
true ->
%% If a node has restarted it may still linger in db_nodes,
%% but have been removed from recover_nodes
Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]),
NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current,
mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
announce_im_running(NewNodes, SchemaCs)
end,
{false, optional};
此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协
商,协商通过后,这些未连接节点将加入本节点的事务节点集群
反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:
mnesia_tm.erl
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of
{Modified, C = #commit{}, DumperMode} ->
%% If we can not find any local unclear decision
%% we should presume abort at startup recovery
case lists:member(node(), DiscNs) of
false ->
ignore;
true ->
case Modified of
false -> mnesia_log:log(Bin);
true -> mnesia_log:log(C)
end
end,
?eval_debug_fun({?MODULE, commit_participant, vote_yes},
[{tid, Tid}]),
reply(Coord, {vote_yes, Tid, self()}),
receive
{Tid, pre_commit} ->
…
receive
{Tid, committed} ->
…
{Tid, {do_abort, _Reason}} ->