Your SlideShare is downloading. ×
0
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

mnesia脑裂问题综述

1,247

Published on

mnesia脑裂问题综述

mnesia脑裂问题综述

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,247
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
73
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. 目录1. 现象与成因............................................................................................................................... 22. mnesia 运行机制 ...................................................................................................................... 33. 常见问题与注意事项............................................................................................................... 64. 源码分析................................................................................................................................... 8 1. mnesia:create_schema/1 的工作过程 ............................................................................. 8 1. 主体过程................................................................................................................... 8 2. 前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9 3. 后半部分 mnesia:start/0 做的工作 ....................................................................... 19 4. mnesia:change_table_majority/2 的工作过程 .............................................................. 23 1. 调用接口................................................................................................................. 23 2. 事务操作................................................................................................................. 24 3. schema 事务提交接口 ........................................................................................... 29 4. schema 事务协议过程 ........................................................................................... 31 5. 远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34 6. 远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37 7. 请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37 8. 远程节点事务参与者第三阶段提交 commit 响应............................................... 39 9. 第三阶段提交 commit 的本地提交过程 .............................................................. 39 5. majority 事务处理 .......................................................................................................... 45 6. 恢复................................................................................................................................. 46
  2. 1. 节点协议版本检查+节点 decision 通告与合并.................................................... 46 2. 节点发现,集群遍历 ............................................................................................. 51 3. 节点 schema 合并 .................................................................................................. 60 4. 节点数据合并部分 1,从远程节点加载表 .......................................................... 62 5. 节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66 6. 节点数据合并部分 2,表加载完成 ...................................................................... 71 7. 分区检测......................................................................................................................... 73 1. 锁过程中的同步检测 ............................................................................................. 73 2. 事务过程中的同步检测 ......................................................................................... 75 3. 节点 down 异步检测.............................................................................................. 80 4. 节点 up 异步检测................................................................................................... 92 8. 其它................................................................................................................................. 95分析代码版本为 erlang 版本 R15B03。1. 现象与成因现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态,分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活的分区中拉取数据,自身原先的数据丢失。原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
  3. 性)约束,一些分布式存储系统为保证可用性,放弃强一致转而追求最终一致。mnesia 也是最终一致的分布式数据库,在没有分区的时候,mnesia 为强一致的,而出现分区后,mnesia 仍然允许写入,因此将呈现不一致的状态。分区消除后,需要应用者处理不一致的状态。简单的恢复过程如重启被放弃的分区,令其重新从保留的分区拉取数据,复杂的恢复过程则需要编写数据订正程序,应用订正程序进行恢复。2. mnesia 运行机制mnesia 运行机制状态图,事务过程采用 majority 事务,即当大多数节点在集群中时,才允许写:mnesia 运行机制解释:
  4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型: a) 无锁无事务脏写,一阶段异步; b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务; c) 有锁同步事务,一阶段同步锁,两阶段同步事务; d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务; e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority 事务;2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商 工作: a) 节点发现; b) 节点协议版本协商; c) 节点 schema 合并; d) 节点事务 decision 合并; i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告 {inconsistent_database, bad_decision, Node},本节点事务结果改为 abort; ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为 abort,此时远程节点将进行修改和通报; iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节 点事务结果,远程节点进行修改; iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务 结果的节点启动,并按照其结果作为事务结果; v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
  5. vi. 事务 decision 并不真正影响实际的数据内容; e) 节点表数据合并: i. 若本节点为 master 节点,则本节点从磁盘加载表数据; ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据; iii. 若远程节点存活,则从远程节点拉取表数据; iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数 据; v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动 加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访 问; vi. 若表数据已经加载,则不会再从远程节点拉取表数据; vii. 从集群角度看: 1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图; 2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视 , 图; 3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图, 各个分区依旧保持分区状态;3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对 事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不 一致,此时将通告应用者一个 inconsistent_database 事件: a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
  6. Node}; b) 重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network, Node}; c) 运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit 事务,即通告{inconsistent_database, bad_decision, Node};3. 常见问题与注意事项此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完备一些,也不包含 schema 操作。fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。常见问题:1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保 持 fail_safe 状态? 若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority, 此时集群变为可写; 若没有任何其他节点启动,则本节点一致保持 fail_safe 状态;2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后, 在 minority 分区写入,此时 minority 如何进入 fail_safe 状态? mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机 将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。 而
  7. 双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态;3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A, 有什么效果?重启 B、C 有什么效果? 经过试验得出: a) 若重启 A,则在 A 中能正确发现 B、 写入的记录, C 这依赖于 A 启动时的协商过程, A 向 B、C 请求表数据; b) 若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协 商过程,B、C 向 A 请求表数据;注意事项:1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master 节点,由他来仲裁最终的数据结果,但这样也会引入单点问题;2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动, 时 一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件;3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅;4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行 事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能;5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点 表的恢复依据;6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报;7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
  8. 4. 源码分析主题包括:1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1 的工作过程;2. 此处使用 majority 事务进行解释, 必须参考 mnesia:change_table_majority/2 的工作过程, 且此过程是 schema 事务,可以更详细全面的理解 majority 事务;3. majority 事务处理将弱化 schema 事务模型,进行特定的解释;4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载;5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件;1. mnesia:create_schema/1 的工作过程1. 主体过程安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。schema 添加的过程本质上是一个两阶段提交过程:schema 变更发起节点1. 询问各个参与节点是否已经由 schema 副本2. 上全局锁{mnesia_table_lock, schema}3. 在各个参与节点上建立 mnesia_fallback 进程4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保 存新生成的 schema 文件备份
  9. 5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创 建真正的"FALLBACK.BUP"文件6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更2. 前半部分 mnesia:create_schema/1 做的工作mnesia.erlcreate_schema(Ns) -> mnesia_bup:create_schema(Ns).mnesia_bup.erlcreate_schema([]) -> create_schema([node()]);create_schema(Ns) when is_list(Ns) -> case is_set(Ns) of true -> create_schema(Ns, mnesia_schema:ensure_no_schema(Ns)); false -> {error, {combine_error, Ns}} end;create_schema(Ns) -> {error, {badarg, Ns}}.mnesia_schema.erlensure_no_schema([H|T]) when is_atom(H) -> case rpc:call(H, ?MODULE, remote_read_schema, []) of {badrpc, Reason} -> {H, {"All nodes not running", H, Reason}}; {ok,Source, _} when Source /= default -> {H, {already_exists, H}}; _ -> ensure_no_schema(T) end;ensure_no_schema([H|_]) -> {error,{badarg, H}};ensure_no_schema([]) -> ok.remote_read_schema() -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok ->
  10. case mnesia_monitor:get_env(schema_location) of opt_disc -> read_schema(false); _ -> read_schema(false) end; {error, Reason} -> {error, Reason} end.询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。回到 mnesia_bup.erlmnesia_bup.erlcreate_schema(Ns, ok) -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok -> case mnesia_monitor:get_env(schema_location) of ram -> {error, {has_no_disc, node()}}; _ -> case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of {error, What} -> {error, What}; ok -> Mod = mnesia_backup, Str = mk_str(), File = mnesia_lib:dir(Str), file:delete(File), case catch make_initial_backup(Ns, File, Mod) of {ok, _Res} -> case do_install_fallback(File, Mod) of ok -> file:delete(File), ok; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end
  11. end end; {error, Reason} -> {error, Reason} end;create_schema(_Ns, {error, Reason}) -> {error, Reason};create_schema(_Ns, Reason) -> {error, Reason}.通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema:make_initial_backup(Ns, Opaque, Mod) -> Orig = mnesia_schema:get_initial_schema(disc_copies, Ns), Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)), Schema = [{schema, schema, Modded}], O2 = do_apply(Mod, open_write, [Opaque], Opaque), O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2), O4 = do_apply(Mod, write, [O3, Schema], O3), O5 = do_apply(Mod, commit_write, [O4], O4), {ok, O5}.创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份中。mnesia_schema.erlget_initial_schema(SchemaStorage, Nodes) -> Cs = #cstruct{name = schema, record_name = schema, attributes = [table, cstruct]}, Cs2 = case SchemaStorage of ram_copies -> Cs#cstruct{ram_copies = Nodes}; disc_copies -> Cs#cstruct{disc_copies = Nodes} end, cs2list(Cs2).mnesia_bup.erldo_install_fallback(Opaque, Mod) when is_atom(Mod) -> do_install_fallback(Opaque, [{module, Mod}]);do_install_fallback(Opaque, Args) when is_list(Args) -> case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
  12. {ok, FA} -> do_install_fallback(FA); {error, Reason} -> {error, Reason} end;do_install_fallback(_Opaque, Args) -> {error, {badarg, Args}}.检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在check_fallback_arg_type/2 中,然后进行安装check_fallback_args([Arg | Tail], FA) -> case catch check_fallback_arg_type(Arg, FA) of {EXIT, _Reason} -> {error, {badarg, Arg}}; FA2 -> check_fallback_args(Tail, FA2) end;check_fallback_args([], FA) -> {ok, FA}.check_fallback_arg_type(Arg, FA) -> case Arg of {scope, global} -> FA#fallback_args{scope = global}; {scope, local} -> FA#fallback_args{scope = local}; {module, Mod} -> Mod2 = mnesia_monitor:do_check_type(backup_module, Mod), FA#fallback_args{module = Mod2}; {mnesia_dir, Dir} -> FA#fallback_args{mnesia_dir = Dir, use_default_dir = false}; {keep_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{keep_tables = Tabs}; {skip_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{skip_tables = Tabs}; {default_op, keep_tables} -> FA#fallback_args{default_op = keep_tables}; {default_op, skip_tables} -> FA#fallback_args{default_op = skip_tables} end.
  13. 此处的构造过程记录 module 参数, mnesia_backup, 为 同时记录 opaque 参数, 为新建 schema文件的文件名。do_install_fallback(FA) -> Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]), Res = receive {EXIT, Pid, Reason} -> % if appl has trapped exit {error, {EXIT, Reason}}; {Pid, Res2} -> case Res2 of {ok, _} -> ok; {error, Reason} -> {error, {"Cannot install fallback", Reason}} end end, Res.install_fallback_master(ClientPid, FA) -> process_flag(trap_exit, true), State = {start, FA}, Opaque = FA#fallback_args.opaque, Mod = FA#fallback_args.module, Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)), unlink(ClientPid), ClientPid ! {self(), Res}, exit(shutdown).从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup,Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。fallback_args 默认定义:-record(fallback_args, {opaque, scope = global, module = mnesia_monitor:get_env(backup_module), use_default_dir = true, mnesia_dir, fallback_bup, fallback_tmp, skip_tables = [],
  14. keep_tables = [], default_op = keep_tables }).iterate(Mod, Fun, Opaque, Acc) -> R = #restore{bup_module = Mod, bup_data = Opaque}, case catch read_schema_section(R) of {error, Reason} -> {error, Reason}; {R2, {Header, Schema, Rest}} -> case catch iter(R2, Header, Schema, Fun, Acc, Rest) of {ok, R3, Res} -> catch safe_apply(R3, close_read, [R3#restore.bup_data]), {ok, Res}; {error, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, Reason}; {EXIT, Pid, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Pid, Reason}}; {EXIT, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Reason}} end end.iter(R, Header, Schema, Fun, Acc, []) -> case safe_apply(R, read, [R#restore.bup_data]) of {R2, []} -> Res = Fun([], Header, Schema, Acc), {ok, R2, Res}; {R2, BupItems} -> iter(R2, Header, Schema, Fun, Acc, BupItems) end;iter(R, Header, Schema, Fun, Acc, BupItems) -> Acc2 = Fun(BupItems, Header, Schema, Acc), iter(R, Header, Schema, Fun, Acc2, []).read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数:restore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2),
  15. case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}}); Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;一个典型的 schema 结构如下:[{schema,schema, [{name,schema}, {type,set}, {ram_copies,[]}, {disc_copies,[rds_la_dev@10.232.64.77]}, {disc_only_copies,[]}, {load_order,0}, {access_mode,read_write}, {index,[]}, {snmp,[]}, {local_content,false}, {record_name,schema}, {attributes,[table,cstruct]}, {user_properties,[]}, {frag_properties,[]}, {cookie,{{1358,676768,107058},rds_la_dev@10.232.64.77}}, {version,{{2,0},[]}}]}]构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将CreateList 还原回 schema 的 cstruct 结构。mnesia_bup.erlrestore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2), case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}});
  16. Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema构建的节点。构建过程要加入集群的全局锁{mnesia_table_lock, schema}。在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程, 处理 schema 的变更。向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程:restore_recs([], _Header, _Schema, Pids) -> send_fallback(Pids, swap), send_fallback(Pids, stop), stop;restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。进入 fallback_receiver 进程的处理过程:fallback_receiver(Master, FA) -> process_flag(trap_exit, true), case catch register(mnesia_fallback, self()) of {EXIT, _} -> Reason = {already_exists, node()}, local_fallback_error(Master, Reason); true -> FA2 = check_fallback_dir(Master, FA), Bup = FA2#fallback_args.fallback_bup, case mnesia_lib:exists(Bup) of
  17. true -> Reason2 = {already_exists, node()}, local_fallback_error(Master, Reason2); false -> Mod = mnesia_backup, Tmp = FA2#fallback_args.fallback_tmp, R = #restore{mode = replace, bup_module = Mod, bup_data = Tmp}, file:delete(Tmp), case catch fallback_receiver_loop(Master, R, FA2, schema) of {error, Reason} -> local_fallback_error(Master, Reason); Other -> exit(Other) end end end.在自身的节点上注册进程名字为 mnesia_fallback。构建初始化状态。进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。fallback_receiver_loop(Master, R, FA, State) -> receive {Master, {start, Header, Schema}} when State =:= schema -> Dir = FA#fallback_args.mnesia_dir, throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)), R2 = safe_apply(R, open_write, [R#restore.bup_data]), R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]), BupSchema = [schema2bup(S) || S <- Schema], R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]), Master ! {self(), ok}, fallback_receiver_loop(Master, R4, FA, records); … end.在本地也创建一个 schema 临时文件, 接收来自变更发起节点构建的 header 部分和新 schema。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
  18. safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup, Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); … end.mnesia_backup.erlcommit_write(OpaqueData) -> B = OpaqueData, case disk_log:sync(B#backup.file_desc) of ok -> case disk_log:close(B#backup.file_desc) of ok -> case file:rename(B#backup.tmp_file, B#backup.file) of ok -> {ok, B#backup.file}; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end.变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名为真正的新建的 schema 文件名,消除最后的".BUPTMP"fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []), safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup,
  19. Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); …end.在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的active_fallback 属性,表明称为一个活动 fallback 节点。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, stop} when State =:= stop -> stopped; … end.收到 stop 消息后,mnesia_fallback 进程退出。3. 后半部分 mnesia:start/0 做的工作mnesia 启 动 , 则 可 以 自 动 通 过 事 务 管 理 器 mnesia_tm 调 用mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中:mnesia_bup.erltm_fallback_start(IgnoreFallback) -> mnesia_schema:lock_schema(), Res = do_fallback_start(fallback_exists(), IgnoreFallback), mnesia_schema: unlock_schema(), case Res of ok -> ok; {error, Reason} -> exit(Reason) end.锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表锁
  20. do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of … end.根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);init_dat_files(Schema, LocalTabs) -> TmpFile = mnesia_lib:tab2tmp(schema), Args = [{file, TmpFile}, {keypos, 2}, {type, set}], case dets:open_file(schema, Args) of % Assume schema lock {ok, _} -> create_dat_files(Schema, LocalTabs), ok = dets:close(schema), LocalTab = #local_tab{ name = schema, storage_type = disc_copies, open = undefined, add = undefined, close = undefined, swap = undefined, record_name = schema, opened = false}, ?ets_insert(LocalTabs, LocalTab); {error, Reason} -> throw({error, {"Cannot open file", schema, Args, Reason}}) end.创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数据恢复到新建的 schema 的 dets 表中。
  21. 调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后调用之,将其它表的元数据持久化到 schema 表中。restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) -> Tab = element(1, Rec), case ?ets_lookup(LocalTabs, Tab) of [] -> State = {not_local, LocalTabs, Tab}, restore_tables(Recs, Header, Schema, State); [LT] when is_record(LT, local_tab) -> State = {local, LocalTabs, LT}, case LT#local_tab.opened of true -> ignore; false -> (LT#local_tab.open)(Tab, LT), ?ets_insert(LocalTabs,LT#local_tab{opened=true}) end, restore_tables(All, Header, Schema, State) end;打开表,不断检查表是否位于本地,若是则进行恢复添加过程:restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) -> Tab = element(1, Rec), if Tab =:= LT#local_tab.name -> Key = element(2, Rec), (LT#local_tab.add)(Tab, Key, Rec, LT), restore_tables(Recs, Header, Schema, State); true -> NewState = {new, LocalTabs}, restore_tables(All, Header, Schema, NewState) end;Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
  22. 待所有表恢复完成后,进行真正的提交工作:do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of {ok, _Res} -> catch dets:close(schema), TmpSchema = mnesia_lib:tab2tmp(schema), DatSchema = mnesia_lib:tab2dat(schema), AllLT = ?ets_match_object(LocalTabs, _), ?ets_delete_table(LocalTabs), case file:rename(TmpSchema, DatSchema) of ok -> [(LT#local_tab.swap)(LT#local_tab.name, LT) || LT <- AllLT, LT#local_tab.name =/= schema], file:delete(BupFile), ok; {error, Reason} -> file:delete(TmpSchema), {error, {"Cannot start from fallback. Rename error.", Reason}} end; {error, Reason} -> {error, {"Cannot start from fallback", Reason}}; {EXIT, Reason} -> {error, {"Cannot start from fallback", Reason}} end.将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema:mnesia_schema.erl
  23. init(IgnoreFallback) -> Res = read_schema(true, IgnoreFallback), {ok, Source, _CreateList} = exit_on_error(Res), verbose("Schema initiated from: ~p~n", [Source]), set({schema, tables}, []), set({schema, local_tables}, []), Tabs = set_schema(?ets_first(schema)), lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs), set({schema, where_to_read}, node()), set({schema, load_node}, node()), set({schema, load_reason}, initial), mnesia_controller:add_active_replica(schema, node()).检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始信息,并将本节点作为 schema 表的初始活动副本若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须同时包含该节点。4. mnesia:change_table_majority/2 的工作过程mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia:change_table_majority/2 更改此属性。该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。1. 调用接口mnesia.erlchange_table_majority(T, M) -> mnesia_schema:change_table_majority(T, M).
  24. mnesia_schema.erlchange_table_majority(Tab, Majority) when is_boolean(Majority) -> schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end).schema_transaction(Fun) -> case get(mnesia_activity_state) of undefined -> Args = [self(), Fun, whereis(mnesia_controller)], Pid = spawn_link(?MODULE, schema_coordinator, Args), receive {transaction_done, Res, Pid} -> Res; {EXIT, Pid, R} -> {aborted, {transaction_crashed, R}} end; _ -> {aborted, nested_transaction} end.启动一个 schema 事务的协调者 schema_coordinator 进程。schema_coordinator(Client, Fun, Controller) when is_pid(Controller) -> link(Controller), unlink(Client), Res = mnesia:transaction(Fun), Client ! {transaction_done, Res, self()}, unlink(Controller), % Avoids spurious exit message unlink(whereis(mnesia_tm)), % Avoids spurious exit message exit(normal).与普通事务不同, schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程,而是 mnesia_controller 进程。启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。2. 事务操作do_change_table_majority(schema, _Majority) -> mnesia:abort({bad_type, schema});do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
  25. 可以看出,不能修改 schema 表的 majority 属性。对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁get_tid_ts_and_lock(Tab, Intent) -> TidTs = get(mnesia_activity_state), case TidTs of {_Mod, Tid, Ts} when is_record(Ts, tidstore)-> Store = Ts#tidstore.store, case Intent of read -> mnesia_locker:rlock_table(Tid, Store, Tab); write -> mnesia_locker:wlock_table(Tid, Store, Tab); none -> ignore end, TidTs; _ -> mnesia:abort(no_transaction) end.上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).关注实际的 majority 属性的修改动作:make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} -> FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority}
  26. end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的schema 节点。通过 incr_version 更新表的版本号。通过 ensure_active 检查所有表的副本节点是否存活, 即与副本节点进行表的全局视图确认。修改表的元数据版本号:incr_version(Cs) -> {{Major, Minor}, _} = Cs#cstruct.version, Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), V= case Nodes -- val({Cs#cstruct.name, active_replicas}) of [] -> {Major + 1, 0}; % All replicas are active _ -> {Major, Minor + 1} % Some replicas are inactive end, Cs#cstruct{version = {V, {node(), now()}}}.mnesia_lib.erlcs_to_nodes(Cs) -> Cs#cstruct.disc_only_copies ++ Cs#cstruct.disc_copies ++ Cs#cstruct.ram_copies.重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符,这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当于时空序列+单调递增序列。版本号的计算类似于 NDB。检查表的全局视图:
  27. ensure_active(Cs) -> ensure_active(Cs, active_replicas).ensure_active(Cs, What) -> Tab = Cs#cstruct.name, W = {Tab, What}, ensure_non_empty(W), Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), case Nodes -- val(W) of [] -> ok; Ns -> Expl = "All replicas on diskfull nodes are not active yet", case val({Tab, local_content}) of true -> case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of {Replies, []} -> check_active(Replies, Expl, Tab); {_Replies, BadNs} -> mnesia:abort({not_active, Expl, Tab, BadNs}) end; false -> mnesia:abort({not_active, Expl, Tab, Ns}) end end.is_remote_member(Key) -> IsActive = lists:member(node(), val(Key)), {IsActive, node()}.为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点,却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态的不一致认知。make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} ->
  28. FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority} end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项成员的类型,内容是否合乎要求。vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct组成的 proplist, OldMajority, Majority}]insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) -> do_insert_schema_ops(Ts#tidstore.store, SchemaIOps).do_insert_schema_ops(Store, [Head | Tail]) -> ?ets_insert(Store, Head), do_insert_schema_ops(Store, Tail);do_insert_schema_ops(_Store, []) -> ok.可以看到, 插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
  29. op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过程。3. schema 事务提交接口mnesia_tm.erlt_commit(Type) -> {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.首先在操作重排时进行检查:arrange(Tid, Store, Type) -> %% The local node is always included Nodes = get_elements(nodes,Store), Recs = prep_recs(Nodes, []), Key = ?ets_first(Store), N = 0, Prep = case Type of async -> #prep{protocol = sym_trans, records = Recs};
  30. sync -> #prep{protocol = sync_sym_trans, records = Recs} end, case catch do_arrange(Tid, Store, Key, Prep, N) of {EXIT, Reason} -> dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]), case Reason of {aborted, R} -> mnesia:abort(R); _ -> mnesia:abort(Reason) end; {New, Prepared} -> {New, Prepared#prep{records = reverse(Prepared#prep.records)}} end.Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。do_arrange(Tid, Store, {Tab, Key}, Prep, N) -> Oid = {Tab, Key}, Items = ?ets_lookup(Store, Oid), %% Store is a bag P2 = prepare_items(Tid, Tab, Key, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1);do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op -> Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag P2 = prepare_schema_items(Tid, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1);可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op,change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事务使用不同的提交协议:prepare_schema_items(Tid, Items, Prep) -> Types = [{N, schema_ops} || N <- val({current, db_nodes})], Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema), Prep#prep{protocol = asym_trans, records = Recs}.prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设置为 asym_trans。prepare_node(_Node, _Storage, Items, Rec, Kind) when Kind == schema, Rec#commit.schema_ops == [] -> Rec#commit{schema_ops = Items};t_commit(Type) ->
  31. {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作, 有recover_coordinator 过程,restore_op 操作。4. schema 事务协议过程multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), ?ets_insert(Store, Pending), {WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
  32. SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})), {Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []), ?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]), case Votes of do_commit -> case SchemaPrep of {_Modified, C = #commit{}, DumperMode} -> mnesia_log:log(C), % C is not a binary ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]), D3 = C#commit.decision, D4 = D3#decision{outcome = unclear}, mnesia_recover:log_decision(D4), ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]), tell_participants(Pids, {Tid, pre_commit}), rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []); {EXIT, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end; {do_abort, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end.事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下:1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存 副本节点数的一半,等于一半时亦不满足2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协 议类型为 asym_trans4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
  33. 交过程 注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变 低5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于 change_table_majority,没有什么需要 prepare 的6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes7. 发起节点收到所有参与节点的同意提交消息8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有 什么需要 prepare 的9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志10. 发起节点记录第一阶段恢复日志 presume_abort;11. 发起节点记录第二阶段恢复日志 unclear12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交14. 参与节点记录第一阶段恢复日志 presume_abort15. 参与节点记录第二阶段恢复日志 unclear16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节 点,用于崩溃恢复过程18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
  34. 日志 committed20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成 b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant 进程的 schema 操作的提交结果 b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点 返回 schema_commit22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源 b.参与节点 commit_participant 进程释放锁和事务资源5. 远程节点事务管理器第一阶段提交 prepare 响应参与节点事务管理器收到第一阶段提交的消息后:mnesia.erldoit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) ->… {From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} -> ?eval_debug_fun({?MODULE, doit_ask_commit}, [{tid, Tid}, {prot, Protocol}]), mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), Pid = case Protocol of asym_trans when node(Tid#tid.pid) /= node() -> Args = [tmpid(From), Tid, Commit, DiscNs, RamNs], spawn_link(?MODULE, commit_participant, Args); _ when node(Tid#tid.pid) /= node() -> %% *_sym_trans reply(From, {vote_yes, Tid}), nopid end, P = #participant{tid = Tid,
  35. pid = Pid, commit = Commit, disc_nodes = DiscNs, ram_nodes = RamNs, protocol = Protocol}, State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)}, doit_loop(State2);…创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘节点列表,内存节点列表],辅助事务提交:commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) -> process_flag(trap_exit, true), Commit = binary_to_term(Bin), commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs);commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) -> process_flag(trap_exit, true), commit_participant(Coord, Tid, C, C, DiscNs, RamNs).commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} ->、 case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), …参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作:mnesia_schema.erlprepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional};
  36. OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), InitBy = schema_prepare, GoodRes = {Modified, Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode}, case DumperMode of optional -> dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]); mandatory -> case mnesia_controller:sync_dump_log(InitBy) of dumped -> GoodRes; {error, Reason} -> mnesia:abort(Reason) end end, case Ops of [] -> ignore; _ -> mnesia_controller:wait_for_schema_commit_lock() end, GoodRes end.注意此处,包含三个主要分支:1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容, optional},这适用于 majority 类表的操作2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running, sync_trans , create_table , delete_table , add_table_copy , del_table_copy , change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能 但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync 日志,这适用于出现上述操作的时候3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true, 原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority 就属于这类操作
  37. 6. 远程节点事务参与者第二阶段提交 precommit 响应commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear,并返回 acc_pre_commit7. 请求节点事务发起者收到第二阶段提交 precommit 确认发起节点收到所有参与节点的 acc_pre_commit 消息后:rec_acc_pre_commit([], Tid, Store, {Commit,OrigC}, Res, DumperMode, GoodPids,SchemaAckPids) -> D = Commit#commit.decision, case Res of do_commit ->
  38. prepare_sync_schema_commit(Store, SchemaAckPids), tell_participants(GoodPids, {Tid, committed}), D2 = D#decision{outcome = committed}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]), do_commit(Tid, Commit, DumperMode), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]), sync_schema_commit(Tid, Store, SchemaAckPids), mnesia_locker:release_tid(Tid), ?MODULE ! {delete_transaction, Tid}; {do_abort, Reason} -> tell_participants(GoodPids, {Tid, {do_abort, Reason}}), D2 = D#decision{outcome = aborted}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]), do_abort(Tid, OrigC), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}]) end, Res.prepare_sync_schema_commit(_Store, []) -> ok;prepare_sync_schema_commit(Store, [Pid | Pids]) -> ?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}), prepare_sync_schema_commit(Store, Pids).发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待:sync_schema_commit(_Tid, _Store, []) -> ok;sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.
  39. 8. 远程节点事务参与者第三阶段提交 commit 响应参与节点 commit_participant 进程收到提交消息后:commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回schema_commit,否则完成事务。9. 第三阶段提交 commit 的本地提交过程do_commit(Tid, C, DumperMode) -> mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode), R = do_snmp(Tid, C#commit.snmp),
  40. R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R), R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2), R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3), mnesia_subscr:report_activity(Tid), R4.这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与参与节点中。对于 schema 表的更新包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更 新表的 where_to_wlock 属性2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct更新过程如下:mnesia_dumper.erlupdate(_Tid, [], _DumperMode) -> dumped;update(Tid, SchemaOps, DumperMode) -> UseDir = mnesia_monitor:use_dir(), Res = perform_update(Tid, SchemaOps, DumperMode, UseDir), mnesia_controller:release_schema_commit_lock(), Res.perform_update(_Tid, _SchemaOps, mandatory, true) -> InitBy = schema_update, ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), opt_dump_log(InitBy);perform_update(Tid, SchemaOps, _DumperMode, _UseDir) -> InitBy = fast_schema_update, InPlace = mnesia_monitor:get_env(dump_log_update_in_place), ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
  41. mnesia_log:version()) of {EXIT, Reason} -> Error = {error, {"Schema update error", Reason}}, close_files(InPlace, Error, InitBy), fatal("Schema update error ~p ~p", [Reason, SchemaOps]); _ -> ?eval_debug_fun({?MODULE, post_dump}, [InitBy]), close_files(InPlace, ok, InitBy), ok end.insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) -> ok;insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), ok;insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver);insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" -> insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver), insert_op(Tid, Storage, Op, InPlace, InitBy).…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…对于 change_table_majority 操作,其本身的格式为:{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置mnesia_controller.erlchange_table_majority(Cs) -> W = fun() -> Tab = Cs#cstruct.name, set({Tab, majority}, Cs#cstruct.majority), update_where_to_wlock(Tab)
  42. end, update(W).update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}).该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分mnesia_dumper.erl…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的表的 cstructmnesia_schema.erlinsert_cstruct(Tid, Cs, KeepWhereabouts) -> Tab = Cs#cstruct.name, TabDef = cs2list(Cs), Val = {schema, Tab, TabDef}, mnesia_checkpoint:tm_retain(Tid, schema, Tab, write), mnesia_subscr:report_table_event(schema, Tid, Val, write), Active = val({Tab, active_replicas}),
  43. case KeepWhereabouts of true -> ignore; false when Active == [] -> clear_whereabouts(Tab); false -> ignore end, set({Tab, cstruct}, Cs), ?ets_insert(schema, Val), do_set_schema(Tab, Cs), Val.do_set_schema(Tab) -> List = get_create_list(Tab), Cs = list2cs(List), do_set_schema(Tab, Cs).do_set_schema(Tab, Cs) -> Type = Cs#cstruct.type, set({Tab, setorbag}, Type), set({Tab, local_content}, Cs#cstruct.local_content), set({Tab, ram_copies}, Cs#cstruct.ram_copies), set({Tab, disc_copies}, Cs#cstruct.disc_copies), set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies), set({Tab, load_order}, Cs#cstruct.load_order), set({Tab, access_mode}, Cs#cstruct.access_mode), set({Tab, majority}, Cs#cstruct.majority), set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)), set({Tab, snmp}, Cs#cstruct.snmp), set({Tab, user_properties}, Cs#cstruct.user_properties), [set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties], set({Tab, frag_properties}, Cs#cstruct.frag_properties), mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties), set({Tab, storage_properties}, Cs#cstruct.storage_properties), set({Tab, attributes}, Cs#cstruct.attributes), Arity = length(Cs#cstruct.attributes) + 1, set({Tab, arity}, Arity), RecName = Cs#cstruct.record_name, set({Tab, record_name}, RecName), set({Tab, record_validation}, {RecName, Arity, Type}), set({Tab, wild_pattern}, wild(RecName, Arity)), set({Tab, index}, Cs#cstruct.index), %% create actual index tabs later set({Tab, cookie}, Cs#cstruct.cookie), set({Tab, version}, Cs#cstruct.version), set({Tab, cstruct}, Cs), Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs), set({Tab, storage_type}, Storage),
  44. mnesia_lib:add({schema, tables}, Tab), Ns = mnesia_lib:cs_to_nodes(Cs), case lists:member(node(), Ns) of true -> mnesia_lib:add({schema, local_tables}, Tab); false when Tab == schema -> mnesia_lib:add({schema, local_tables}, Tab); false -> ignore end.do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等mnesia_dumper.erlinsert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) -> case open_files(Tab, Storage, InPlace, InitBy) of true -> case Storage of disc_copies when Tab /= schema -> mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}), ok; _ -> dets_insert(Op,Tab,Key,Val) end; false -> ignore end.dets_insert(Op,Tab,Key,Val) -> case Op of write -> dets_updated(Tab,Key), ok = dets:insert(Tab, Val); … end.dets_updated(Tab,Key) -> case get(mnesia_dumper_dets) of undefined -> Empty = gb_trees:empty(),
  45. Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty), put(mnesia_dumper_dets, Tree); Tree -> case gb_trees:lookup(Tab,Tree) of {value, cleared} -> ignore; {value, Set} -> T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree), put(mnesia_dumper_dets, T); none -> T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree), put(mnesia_dumper_dets, T) end end.更新 schema 的 dets 表中记录的表 cstruct。综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时有良好的崩溃恢复检测schema 表的变更包括对多处地方的更新,包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct5. majority 事务处理majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
  46. 6. 恢复mnesia 的连接协商过程用于在启动时,结点间交互状态信息:整个协商包括如下过程:1. 节点发现,集群遍历2. 节点协议版本检查3. 节点 schema 合并4. 节点 decision 通告与合并5. 节点数据重新载入与合并1. 节点协议版本检查+节点 decision 通告与合并mnesia_recover.erlconnect_nodes(Ns) -> %%Ns 为要检查的节点 call({connect_nodes, Ns}).handle_call({connect_nodes, Ns}, From, State) -> %% Determine which nodes we should try to connect AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, %%开始版本协商 case mnesia_monitor:negotiate_protocol(Check) of busy -> %% monitor is disconnecting some nodes retry %% the req (to avoid deadlock). erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> %% No good noodes to connect to! %% We cant use reply here because this function can be
  47. %% called from handle_info gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> %% GoodNodes 是协商通过的节点 %% Now we have agreed upon a protocol with some new nodes %% and we may use them when we recover transactions mnesia_lib:add_list(recover_nodes, GoodNodes), %%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, %%检查曾经是否与这些节点出现过分区 mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State} end;handle_cast({announce_all, Nodes}, State) -> announce_all(Nodes), {noreply, State};announce_all([]) -> ok;announce_all(ToNodes) -> Tid = trans_tid_serial(), announce(ToNodes, [{trans_tid,serial,Tid}], [], false).announce(ToNodes, [Head | Tail], Acc, ForceSend) -> Acc2 = arrange(ToNodes, Head, Acc, ForceSend), announce(ToNodes, Tail, Acc2, ForceSend);announce(_ToNodes, [], Acc, _ForceSend) -> send_decisions(Acc).send_decisions([{Node, Decisions} | Tail]) -> %%注意此处,decision 合并过程是一个异步过程 abcast([Node], {decisions, node(), Decisions}), send_decisions(Tail);send_decisions([]) ->
  48. ok.遍历所有协商通过的节点,告知其本节点的历史事务 decision下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点handle_cast({decisions, Node, Decisions}, State) -> mnesia_lib:add(recover_nodes, Node), State2 = add_remote_decisions(Node, Decisions, State), {noreply, State2};接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) -> State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [C | Tail], State) when is_record(C, transient_decision) -> D = #decision{tid = C#transient_decision.tid, outcome = C#transient_decision.outcome, disc_nodes = [], ram_nodes = []}, State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) -> add_remote_decisions(Node, Tail, State);add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) -> %%对于发送节点传来的未决事务,接收节点需要继续询问其它节点 sync_trans_tid_serial(Serial), case State#state.unclear_decision of undefined -> ignored; D -> case lists:member(Node, D#decision.ram_nodes) of true -> ignore; false -> %%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询问该未决事务的真正结果 abcast([Node], {what_decision, node(), D}) end
  49. end, add_remote_decisions(Node, Tail, State);add_remote_decisions(_Node, [], State) -> State.add_remote_decision(Node, NewD, State) -> Tid = NewD#decision.tid, OldD = decision(Tid), %%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo日志进行重构 D = merge_decisions(Node, OldD, NewD), %%记录合并结果 do_log_decision(D, false, undefined), Outcome = D#decision.outcome, if OldD == no_decision -> ignore; Outcome == unclear -> ignore; true -> case lists:member(node(), NewD#decision.disc_nodes) or lists:member(node(), NewD#decision.ram_nodes) of true -> %%向其它节点告知本节点的 decision 合并结果 tell_im_certain([Node], D); false -> ignore end end, case State#state.unclear_decision of U when U#decision.tid == Tid -> WaitFor = State#state.unclear_waitfor -- [Node], if Outcome == unclear, WaitFor == [] -> %% Everybody are uncertain, lets abort %%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交结果,此时决定终止事务 NewOutcome = aborted, CertainD = D#decision{outcome = NewOutcome,
  50. disc_nodes = [], ram_nodes = []}, tell_im_certain(D#decision.disc_nodes, CertainD), tell_im_certain(D#decision.ram_nodes, CertainD), do_log_decision(CertainD, false, undefined), verbose("Decided to abort transaction ~p " "since everybody are uncertain ~p~n", [Tid, CertainD]), gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome /= unclear -> %%发送节点知道事务结果,通告事务结果 verbose("~p told us that transaction ~p was ~p~n", [Node, Tid, Outcome]), gen_server:reply(State#state.unclear_pid, {ok, Outcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome == unclear -> %%发送节点也不知道事务结果,此时继续等待 State#state{unclear_waitfor = WaitFor} end; _ -> State end.合并策略:merge_decisions(Node, D, NewD0) -> NewD = filter_aborted(NewD0), if D == no_decision, node() /= Node -> %% We did not know anything about this txn NewD#decision{disc_nodes = []}; D == no_decision -> NewD; is_record(D, decision) -> DiscNs = D#decision.disc_nodes -- ([node(), Node]), OldD = filter_aborted(D#decision{disc_nodes = DiscNs}), if
  51. OldD#decision.outcome == unclear, NewD#decision.outcome == unclear -> D; OldD#decision.outcome == NewD#decision.outcome -> %% We have come to the same decision OldD; OldD#decision.outcome == committed, NewD#decision.outcome == aborted -> %%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发送节点中止事务,此时仍然选择中止事务 Msg = {inconsistent_database, bad_decision, Node}, mnesia_lib:report_system_event(Msg), OldD#decision{outcome = aborted}; OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; OldD#decision.outcome == committed, NewD#decision.outcome == unclear -> OldD#decision{outcome = committed}; OldD#decision.outcome == unclear, NewD#decision.outcome == committed -> OldD#decision{outcome = committed} end end.2. 节点发现,集群遍历mnesia_controller.erlmerge_schema() -> AllNodes = mnesia_lib:all_nodes(), %%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移 case try_merge_schema(AllNodes, [node()], fun default_merge/1) of ok -> %%合并 schema 成功后,将进行数据合并 schema_is_merged(); {aborted, {throw, Str}} when is_list(Str) -> fatal("Failed to merge schema: ~s~n", [Str]); Else -> fatal("Failed to merge schema: ~p~n", [Else]) end.
  52. try_merge_schema(Nodes, Told0, UserFun) -> %%开始集群遍历,启动一个 schema 合并事务 case mnesia_schema:merge_schema(UserFun) of {atomic, not_merged} -> %% No more nodes that we need to merge the schema with %% Ensure we have told everybody that we are running case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of [] -> ok; Tell -> im_running(Tell, [node()]), ok end; {atomic, {merged, OldFriends, NewFriends}} -> %% Check if new nodes has been added to the schema Diff = mnesia_lib:all_nodes() -- [node() | Nodes], mnesia_recover:connect_nodes(Diff), %% Tell everybody to adopt orphan tables %%通知所有的集群节点,本节点启动,开始数据合并申请 im_running(OldFriends, NewFriends), im_running(NewFriends, OldFriends), Told = case lists:member(node(), NewFriends) of true -> Told0 ++ OldFriends; false -> Told0 ++ NewFriends end, try_merge_schema(Nodes, Told, UserFun); {atomic, {"Cannot get cstructs", Node, Reason}} -> dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]), timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); {aborted, {shutdown, _}} -> %% One of the nodes is going down timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); Other -> Other end.mnesia_schema.erlmerge_schema() -> schema_transaction(fun() -> do_merge_schema([]) end).merge_schema(UserFun) -> schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
  53. 题操作包括:{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}{op, merge_schema, CstructList}这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。do_merge_schema(LockTabs0) -> %% 锁 schema 表 {_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write), LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0], [get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs], Connected = val(recover_nodes), Running = val({current, db_nodes}), Store = Ts#tidstore.store, %% Verify that all nodes are locked that might not be the %% case, if this trans where queued when new nodes where added. case Running -- ets:lookup_element(Store, nodes, 2) of [] -> ok; %% All known nodes are locked Miss -> %% Abort! We dont want the sideeffects below to be executed mnesia:abort({bad_commit, {missing_lock, Miss}}) end, %% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点; Running是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点; case Connected -- Running of %% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法) ,这个过程由某个节点发起, [Node | _] = OtherNodes -> %% Time for a schema merging party! mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]), [mnesia_locker:wlock_no_exist( Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes)) || {T,Ns} <- LockTabs], %% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1 case fetch_cstructs(Node) of {cstructs, Cstructs, RemoteRunning1} ->
  54. LockedAlready = Running ++ [Node], %% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这些结点是否曾有过分区 {New, Old} = mnesia_recover:connect_nodes(RemoteRunning1), %% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存 Old活结点,来自于 recover_nodes RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1), If %% RemoteRunning = (New∪Old)∩RemoteRunning1 %% RemoteRunning≠RemoteRunning <=> %% New∪(Old∩RemoteRunning1) < RemoteRunning1 %%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集群)中有部分节点不能与本节点相连 RemoteRunning /= RemoteRunning1 -> mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n", [node(), RemoteRunning1 -- RemoteRunning]), mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning}); true -> ok end, NeedsLock = RemoteRunning -- LockedAlready, mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock), [mnesia_locker:wlock_no_exist(Tid, Store, T,mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs], NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready), {value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs), SchemaDef = cs2list(NeedsConversion, SchemaCs), %% Announce that Node is running %%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并 A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
  55. do_insert_schema_ops(Store, A), %% Introduce remote tables to local node %%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提交成功后由 mnesia_dumper 执行生效 do_insert_schema_ops(Store, make_merge_schema(Node, NeedsConversion,Cstructs)), %% Introduce local tables to remote nodes Tabs = val({schema, tables}), Ops = [{op, merge_schema, get_create_list(T)} || T <- Tabs, not lists:keymember(T, #cstruct.name, Cstructs)], do_insert_schema_ops(Store, Ops), %%Ensure that the txn will be committed on all nodes %%向另一个可连接集群中的所有节点通告本节点正在加入集群 NewNodes = RemoteRunning -- Running, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs), {merged, Running, RemoteRunning}; {error, Reason} -> {"Cannot get cstructs", Node, Reason}; {badrpc, Reason} -> {"Cannot get cstructs", Node, {badrpc, Reason}} end; [] -> %% No more nodes to merge schema with not_merged end.announce_im_running([N | Ns], SchemaCs) -> %%与新的可连接集群的节点经过协商 {L1, L2} = mnesia_recover:connect_nodes([N]), case lists:member(N, L1) or lists:member(N, L2) of true -> %%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是立即生效的,而不会延迟到事务提交 mnesia_lib:add({current, db_nodes}, N), mnesia_controller:add_active_replica(schema, N, SchemaCs);
  56. false -> %%若协商未通过,则中止事务,此时会通过 announce_im_running 的 undo 动作,将新加入的事务节点全部剥离 mnesia_lib:error("Mnesia on ~p could not connect to node ~p~n", [node(), N]), mnesia:abort({node_not_running, N}) end, announce_im_running(Ns, SchemaCs);announce_im_running([], _) -> [].schema 操作在三阶段提交时,mnesia_tm 首先要进行 prepare:mnesia_tm.erlmulti_commit(asym_trans, Majority, Tid, CR, Store) ->… SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),…mnesia_schema.erlprepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional}; OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), … end.prepare_ops(Tid, [Op | Ops], WaitFor, Changed, Acc, DumperMode) -> case prepare_op(Tid, Op, WaitFor) of … {false, optional} -> prepare_ops(Tid, Ops, WaitFor, true, Acc, DumperMode) end;prepare_ops(_Tid, [], _WaitFor, Changed, Acc, DumperMode) -> {Changed, Acc, DumperMode}.prepare_op(_Tid, {op, announce_im_running, Node, SchemaDef, Running, RemoteRunning},_WaitFor) -> SchemaCs = list2cs(SchemaDef), if Node == node() -> %% Announce has already run on local node
  57. ignore; %% from do_merge_schema true -> %% If a node has restarted it may still linger in db_nodes, %% but have been removed from recover_nodes Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]), NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs) end, {false, optional};此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协商,协商通过后,这些未连接节点将加入本节点的事务节点集群反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:mnesia_tm.erlcommit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} -> %% If we can not find any local unclear decision %% we should presume abort at startup recovery case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), receive {Tid, pre_commit} -> … receive {Tid, committed} -> … {Tid, {do_abort, _Reason}} ->
  58. … mnesia_schema:undo_prepare_commit(Tid, C0), … {EXIT, _, _} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … end; {Tid, {do_abort, Reason}} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … {EXIT, _, Reason} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … end; {EXIT, Reason} -> … mnesia_schema:undo_prepare_commit(Tid, C0) end, ….mnesia_schema.erlundo_prepare_commit(Tid, Commit) -> case Commit#commit.schema_ops of [] -> ignore; Ops -> %% Catch to allow failure mnesia_controller may not be started catch mnesia_controller:release_schema_commit_lock(), undo_prepare_ops(Tid, Ops) end, Commit.undo_prepare_ops(Tid, [Op | Ops]) -> case element(1, Op) of TheOp when TheOp /= op, TheOp /= restore_op -> undo_prepare_ops(Tid, Ops); _ -> undo_prepare_ops(Tid, Ops), undo_prepare_op(Tid, Op) end;undo_prepare_ops(_Tid, []) -> [].undo_prepare_op(_Tid, {op, announce_im_running, _Node, _, _Running, _RemoteRunning}) ->
  59. case ?catch_val(prepare_op) of {announce_im_running, New} -> unannounce_im_running(New); _Else -> ok end;unannounce_im_running([N | Ns]) -> mnesia_lib:del({current, db_nodes}, N), mnesia_controller:del_active_replica(schema, N), unannounce_im_running(Ns);unannounce_im_running([]) -> ok.由此可见集群发现与合并事务节点加入:mnesia_controller.erladd_active_replica(Tab, Node, Storage, AccessMode) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), case AccessMode of read_write -> New = lists:sort([{Node, Storage} | Del]), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:add_lsort({Tab, where_to_write}, Node); read_only -> set(Var, mark_blocked_tab(Blocked, Del)), mnesia_lib:del({Tab, where_to_write}, Node) end, update_where_to_wlock(Tab), add({Tab, active_replicas}, Node).事务节点删除:mnesia_controller.erldel_active_replica(Tab, Node) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), New = lists:sort(Del), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:del({Tab, active_replicas}, Node),
  60. mnesia_lib:del({Tab, where_to_write}, Node), update_where_to_wlock(Tab).3. 节点 schema 合并schema 操作构造,该过程较长,此处仅总结:schema 表合并:1. 本节点与远程节点的 schema 表的 cookie 不同, 且二者有不同的 master 或都没有 master, 此时不能合并;2. 本节点与远程节点的 schema 表的存储类型不同,且二者都是 disc_copies,此时不能合 并;普通表合并:1. 本节点与远程节点的普通表的 cookie 不同,且二者有不同的 master 或都没有 master, 此时不能合并;在可以合并的情况下,需要合并表的 cstruct,storage_type,version:1. 合并 storage_type 时,disc_copies 与 ram_copies 优先选择 disc_copies,disc_only_copies 与 disc_copies 优先选择 disc_only_copies,而 ram_copies 与 disc_only_copies 不兼容;2. 合并 version 时,要求表的主要定义属性必须相同,同时选择主、次版本号较大的一方;由此可见,schema 在大多数情况下还是很容易合并成功的。schema 真正的写入过程在提交阶段 do_commit 中进行:mnesia_dumper.erlinsert_op(Tid, _, {op, merge_schema, TabDef}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case Cs#cstruct.name of schema -> Update = fun(NS = {Node,Storage}) ->
  61. case mnesia_lib:cs_to_storage_type(Node, Cs) of Storage -> NS; disc_copies when Node == node() -> Dir = mnesia_lib:dir(), ok = mnesia_schema:opt_create_dir(true, Dir), mnesia_schema:purge_dir(Dir, []), mnesia_log:purge_all_logs(), mnesia_lib:set(use_dir, true), mnesia_log:init(), Ns = val({current, db_nodes}), F = fun(U) -> mnesia_recover:log_mnesia_up(U) end, lists:foreach(F, Ns), raw_named_dump_table(schema, dat), temp_set_master_nodes(), {Node,disc_copies}; CSstorage -> {Node,CSstorage} end end, W2C0 = val({schema, where_to_commit}), W2C = case W2C0 of {blocked, List} -> {blocked,lists:map(Update,List)}; List -> lists:map(Update,List) end, if W2C == W2C0 -> ignore; true -> mnesia_lib:set({schema, where_to_commit}, W2C) end; _ -> ignore end, insert_cstruct(Tid, Cs, false, InPlace, InitBy);insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.分别在 ets 表 mnesia_gvar、内存 schema 表、磁盘 schema 表中记录新的表 cstruct 及其相关信息。
  62. 4. 节点数据合并部分 1,从远程节点加载表mnesia_controller 在 merge_schema->try_merge_schema 中调用 im_running 两次,一次通知存活节点,告知本节点及本节点发现的新节点,另一次通知新节点(新节点包括自身),告知本节点及本节点集群内的其它节点,这样便联通了旧有集群和新集群。存活节点收到 im_running 消息后,立即通知新节点(包括发出 im_running 请求的节点)一个 adopt_orphans 消息,令其接收自己的表。由于本节点广播 im_running 消息,也会有多个adopt_orphans 消息递送到本节点,本节点将从第一个到达的 adopt_orphans 消息的来源节点处取得表数据。通常情况下,若一次仅有本节点启动,则这个动作将被简化为,本节点通知向所有的存活节点通告 im_running,然后一个最先返回的存活节点将向本节点传递表数据。mnesia_controller.erlim_running(OldFriends, NewFriends) -> abcast(OldFriends, {im_running, node(), NewFriends}).handle_cast({im_running, Node, NewFriends}, State) -> LocalTabs = mnesia_lib:local_active_tables() -- [schema], RemoveLocalOnly = fun(Tab) -> not val({Tab, local_content}) end, Tabs = lists:filter(RemoveLocalOnly, LocalTabs), Nodes = mnesia_lib:union([Node],val({current, db_nodes})), Ns = mnesia_lib:intersect(NewFriends, Nodes), %%通知远程节点,与本节点进行数据交互 abcast(Ns, {adopt_orphans, node(), Tabs}), noreply(State);本节点通知远程节点,与本节点进行数据交互。注意此时 decision 合并过程也同时在进行中。远程节点向本节点返回 adopt_orphans 消息。
  63. 本节点收到一个远程节点发送的 adopt_orphans 消息后,将开始从这个节点处取得数据:handle_cast({adopt_orphans, Node, Tabs}, State) -> %%node_has_tabs 将抢先一步从当前存活的节点处拉取数据,若这一步能取得表数据,则之后不会再从本地磁盘中取数据,为了保持数据一致性,最好选择设置 master 节点,全局数据与 master 保持一致。 %%本节点将远程节点加入表的活动副本中,并开始异步取得数据 State2 = node_has_tabs(Tabs, Node, State), case ?catch_val({node_up,Node}) of true -> ignore; _ -> %% Register the other node as up and running %%标识远程节点 up,并产生 mnesia_up 事件 set({node_up, Node}, true), mnesia_recover:log_mnesia_up(Node), verbose("Logging mnesia_up ~w~n",[Node]), mnesia_lib:report_system_event({mnesia_up, Node}), %% Load orphan tables LocalTabs = val({schema, local_tables}) -- [schema], Nodes = val({current, db_nodes}), %%若未设置 master,则 RemoteMasters 为[],若无 local 表,则 LocalOrphans 为[]。 %%若有 local 表,则从磁盘加载这些 local 表 {LocalOrphans, RemoteMasters} = orphan_tables(LocalTabs, Node, Nodes, [], []), Reason = {adopt_orphan, node()}, mnesia_late_loader:async_late_disc_load(node(), LocalOrphans, Reason), Fun = fun(N) -> RemoteOrphans = [Tab || {Tab, Ns} <- RemoteMasters, lists:member(N, Ns)], mnesia_late_loader:maybe_async_late_disc_load(N, RemoteOrphans, Reason) end, lists:foreach(Fun, Nodes) end, noreply(State2);
  64. node_has_tabs([Tab | Tabs], Node, State) when Node /= node() -> State2 = case catch update_whereabouts(Tab, Node, State) of State1 = #state{} -> State1; {EXIT, R} -> %% Tab was just deleted? case ?catch_val({Tab, cstruct}) of {EXIT, _} -> State; % yes _ -> erlang:error(R) end end, node_has_tabs(Tabs, Node, State2);update_whereabouts(Tab, Node, State) -> Storage = val({Tab, storage_type}), Read = val({Tab, where_to_read}), LocalC = val({Tab, local_content}), BeingCreated = (?catch_val({Tab, create_table}) == true), Masters = mnesia_recover:get_master_nodes(Tab), ByForce = val({Tab, load_by_force}), GoGetIt = if ByForce == true -> true; Masters == [] -> true; true -> lists:member(Node, Masters) end, if … %%启动时,有多个副本的表,其 where_to_read 首先设置为 nowhere,此处触发表的远程节点加载过程: Read == nowhere -> add_active_replica(Tab, Node), case GoGetIt of true -> %%产生一个#net_load{}任务,通过 opt_start_loader->load_and_reply 启动一个load_table_fun 来处理这个任务。 Worker = #net_load{table = Tab, reason = {active_remote, Node}}, add_worker(Worker, State); false -> State
  65. end; … end.load_table_fun(#net_load{cstruct=Cs, table=Tab, reason=Reason, opt_reply_to=ReplyTo}) -> LocalC = val({Tab, local_content}), AccessMode = val({Tab, access_mode}), ReadNode = val({Tab, where_to_read}), Active = filter_active(Tab), Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = (ReplyTo /= undefined), reply_to = ReplyTo, reply = {loaded, ok} }, if ReadNode == node() -> %% Already loaded locally fun() -> Done end; LocalC == true -> fun() -> Res = mnesia_loader:disc_load_table(Tab, load_local_content), Done#loader_done{reply = Res, needs_announce = true, needs_sync = true} end; AccessMode == read_only, Reason /= {dumper,add_table_copy} -> fun() -> disc_load_table(Tab, Reason, ReplyTo) end; true -> fun() -> %% Either we cannot read the table yet %% or someone is moving a replica between %% two nodes %%加载过程为创建表对应的 ets 表,然后从远程节点的发送进程逐条读取记录到本节点的接收进程,有本地接收进程将记录重新插入 ets 表。 Res = mnesia_loader:net_load_table(Tab, Reason, Active, Cs), case Res of {loaded, ok} -> Done#loader_done{needs_sync = true, reply = Res}; {not_loaded, _} -> Done#loader_done{is_loaded = false,
  66. reply = Res} end end end;5. 节点数据合并部分 2,从本地磁盘加载表由于 try_merge_schema 将首先从存活的节点处取得数据,本地节点将优先保持与远程节点的数据一致,相比于本地节点,mnesia 倾向于选择存活的节点的数据副本,因为他们可能保持着最新的内容。但若没有任何远程存活节点(如集群整体关闭,本节点为第一个启动节点),则此时考虑从本地磁盘加载数据。主要场景:1. 若节点不是一个已关闭集群中第一个启动的节点,若为 master,则加载所有表,若非 master,则仅能加载 local 表。2. 若节点是一个已关闭集群中第一个启动的节点,若: a) 这个节点是最后一个关闭的(由节点 decision 表中的 mnesia_down 历史记录确定), 则节点将从本地磁盘加载表; b) 这个节点不是最后一个关闭的,则节点不从本地磁盘加载表,同时等待其他远程节 点启动,并通知本节点 adopt_orphans 消息,才从远程节点处加载表,远程节点未 返回前,表在本节点不可见(表定义可以从 mnesia:schema/0 中获取,但表对应的 ets 表还没有创建,因此表不可见)schema_is_merged() -> MsgTag = schema_is_merged, %%根据 mnesia_down 的历史记录,确认本节点是否为集群最后一个关闭的节点或表的master 节点,若是,则确定将表从本地加载:
  67. SafeLoads = initial_safe_loads(), try_schedule_late_disc_load(SafeLoads, initial, MsgTag).initial_safe_loads() -> case val({schema, storage_type}) of ram_copies -> Downs = [], Tabs = val({schema, local_tables}) -- [schema], LastC = fun(T) -> last_consistent_replica(T, Downs) end, lists:zf(LastC, Tabs); disc_copies -> Downs = mnesia_recover:get_mnesia_downs(), dbg_out("mnesia_downs = ~p~n", [Downs]), Tabs = val({schema, local_tables}) -- [schema], LastC = fun(T) -> last_consistent_replica(T, Downs) end, lists:zf(LastC, Tabs) end.last_consistent_replica(Tab, Downs) -> Cs = val({Tab, cstruct}), Storage = mnesia_lib:cs_to_storage_type(node(), Cs), Ram = Cs#cstruct.ram_copies, Disc = Cs#cstruct.disc_copies, DiscOnly = Cs#cstruct.disc_only_copies, BetterCopies0 = mnesia_lib:remote_copy_holders(Cs) -- Downs, BetterCopies = BetterCopies0 -- Ram, AccessMode = Cs#cstruct.access_mode, Copies = mnesia_lib:copy_holders(Cs), Masters = mnesia_recover:get_master_nodes(Tab), LocalMaster0 = lists:member(node(), Masters), LocalContent = Cs#cstruct.local_content, RemoteMaster = if Masters == [] -> false; true -> not LocalMaster0 end, LocalMaster = if Masters == [] -> false; true -> LocalMaster0 end, if Copies == [node()] ->
  68. %% Only one copy holder and it is local. %% It may also be a local contents table {true, {Tab, local_only}}; LocalContent == true -> {true, {Tab, local_content}}; LocalMaster == true -> %% We have a local master {true, {Tab, local_master}}; RemoteMaster == true -> %% Wait for remote master copy false; Storage == ram_copies -> if Disc == [], DiscOnly == [] -> %% Nobody has copy on disc {true, {Tab, ram_only}}; true -> %% Some other node has copy on disc false end; AccessMode == read_only -> %% No one has been able to update the table, %% i.e. all disc resident copies are equal {true, {Tab, read_only}}; BetterCopies /= [], Masters /= [node()] -> %% There are better copies on other nodes %% and we do not have the only master copy false; true -> {true, {Tab, initial}} end.try_schedule_late_disc_load(Tabs, _Reason, MsgTag) when Tabs == [], MsgTag /= schema_is_merged -> ignore;try_schedule_late_disc_load(Tabs, Reason, MsgTag) -> %%通过一个 mnesia 事务来进行表加载过程 GetIntents = fun() -> %%上一个全局磁盘表加载锁 mnesia_late_disc_load Item = mnesia_late_disc_load, Nodes = val({current, db_nodes}),
  69. mnesia:lock({global, Item, Nodes}, write), %%询问其它远程节点,它们是否正在加载或已经加载了这些表,若正在加载或已经加载,则本节点不会从磁盘加载,而是等待远程节点产生 adopt_orphans 消息,告知本节点远程加载表。 case multicall(Nodes -- [node()], disc_load_intents) of {Replies, []} -> %%等待表加载完成: %%MsgTag = schema_is_merged call({MsgTag, Tabs, Reason, Replies}), done; {_, BadNodes} -> %% Some nodes did not respond, lets try again {retry, BadNodes} end end, case mnesia:transaction(GetIntents) of {atomic, done} -> done; {atomic, {retry, BadNodes}} -> verbose("Retry late_load_tables because bad nodes: ~p~n", [BadNodes]), try_schedule_late_disc_load(Tabs, Reason, MsgTag); {aborted, AbortReason} -> fatal("Cannot late_load_tables~p: ~p~n", [[Tabs, Reason, MsgTag], AbortReason]) end.handle_call({schema_is_merged, TabsR, Reason, RemoteLoaders}, From, State) -> %% 产 生 一 个 #disc_load{} 任 务 , 通 过 opt_start_loader->load_and_reply 启 动 一 个load_table_fun 来处理这个任务。 State2 = late_disc_load(TabsR, Reason, RemoteLoaders, From, State), Msgs = State2#state.early_msgs, State3 = State2#state{early_msgs = [], schema_is_merged = true}, handle_early_msgs(lists:reverse(Msgs), State3);load_table_fun(#disc_load{table=Tab, reason=Reason, opt_reply_to=ReplyTo}) -> ReadNode = val({Tab, where_to_read}), Active = filter_active(Tab),
  70. Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = false }, if Active == [], ReadNode == nowhere -> %% Not loaded anywhere, lets load it from disc fun() -> disc_load_table(Tab, Reason, ReplyTo) end; ReadNode == nowhere -> %% Already loaded on other node, lets get it Cs = val({Tab, cstruct}), fun() -> case mnesia_loader:net_load_table(Tab, Reason, Active, Cs) of {loaded, ok} -> Done#loader_done{needs_sync = true}; {not_loaded, storage_unknown} -> Done#loader_done{is_loaded = false}; {not_loaded, ErrReason} -> Done#loader_done{is_loaded = false, reply = {not_loaded,ErrReason}} end end; true -> %% Already readable, do not worry be happy fun() -> Done end end.disc_load_table(Tab, Reason, ReplyTo) -> Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = ReplyTo /= undefined, reply_to = ReplyTo, reply = {loaded, ok} }, %%加载过程为从表的磁盘数据文件 Table.DCT 中取得数据(erlang 的 term),并从日志文件 Table.DCL(若有)中取得 redo 日志,合并到数据的 ets 表中。 Res = mnesia_loader:disc_load_table(Tab, Reason), if Res == {loaded, ok} ->
  71. Done#loader_done{needs_announce = true, needs_sync = true, reply = Res}; ReplyTo /= undefined -> Done#loader_done{is_loaded = false, reply = Res}; true -> fatal("Cannot load table ~p from disc: ~p~n", [Tab, Res]) end.最终通过 mnesia_loader:disc_load_table 从磁盘加载表。由此可见,不管是从网络还是从磁盘加载表,最终都是通过 load_table_fun->mnesia_loader进行表加载,只是加载函数有区别,一个为 net_load_table,另一个为 disc_load_table。这两个函数的主要工作过程相似,都为建立表对应的 ets 表,然后从远程节点/磁盘取回记录插入 ets 表。6. 节点数据合并部分 2,表加载完成表加载完成后,load_table_fun 向 mnesia_controller 返回一个#loader_done{}:handle_info(Done = #loader_done{worker_pid=WPid, table_name=Tab}, State0) -> LateQueue0 = State0#state.late_loader_queue, State1 = State0#state{loader_pid = lists:keydelete(WPid,1,get_loaders(State0))}, State2 = case Done#loader_done.is_loaded of true -> %% Optional table announcement if Done#loader_done.needs_announce == true, Done#loader_done.needs_reply == true -> i_have_tab(Tab), %% Should be {dumper,add_table_copy} only reply(Done#loader_done.reply_to, Done#loader_done.reply); Done#loader_done.needs_reply == true -> %% Should be {dumper,add_table_copy} only
  72. reply(Done#loader_done.reply_to, Done#loader_done.reply); Done#loader_done.needs_announce == true, Tab == schema -> i_have_tab(Tab); Done#loader_done.needs_announce == true -> i_have_tab(Tab), %% Local node needs to perform user_sync_tab/1 Ns = val({current, db_nodes}), abcast(Ns, {i_have_tab, Tab, node()}); Tab == schema -> ignore; true -> %% Local node needs to perform user_sync_tab/1 Ns = val({current, db_nodes}), AlreadyKnows = val({Tab, active_replicas}), %%表加载完成后,本节点会向其它节点发送一个 i_have_tab 消息,通知其他节点本节点持有最完整的表副本,其它节点进一步通过 node_has_tabs->update_whereabouts向本节点取得表数据,加入自己的副本中。 abcast(Ns -- AlreadyKnows, {i_have_tab, Tab, node()}) end, %% Optional user sync case Done#loader_done.needs_sync of true -> user_sync_tab(Tab); false -> ignore end, State1#state{late_loader_queue=gb_trees:delete_any(Tab, LateQueue0)}; false -> %% Either the node went down or table was not %% loaded remotly yet case Done#loader_done.needs_reply of true -> reply(Done#loader_done.reply_to, Done#loader_done.reply); false -> ignore end, case ?catch_val({Tab, active_replicas}) of [_|_] -> % still available elsewhere {value,{_,Worker}} = lists:keysearch(WPid,1,get_loaders(State0)), add_loader(Tab,Worker,State1); _ ->
  73. State1 end end, State3 = opt_start_worker(State2), noreply(State3);由此可见,表副本同步的过程包括 push 和 pull 两类方式:1. 节点启动时会主动从活动节点中 pull 数据过来,这一条线为:本节点-im_running->远程 节点-adopt_orphans->本节点,本节点通过 node_has_tabs 建立#net_load{}任务从远程节 点 pull 数据;2. 节点启动时加载表后,会主动向其它活动节点 push 本节点加载的表,这一条线为:本 节点合并 schema 完毕,决定从本地磁盘加载表,加载完成后,本节点-i_have_tab->远程 节点,远程节点通过 node_has_tabs 建立#net_load{}任务从本节点 pull 数据;7. 分区检测同步检测是指 mnesia 在加分布式锁或进行事务提交时,进行的分区检测,由于 mnesia 主动连接各个参与节点,因此这一步直接集成在了锁和事务协议中。对于 majority 类表来说,锁和事务协议交互中,若发现存活的可参与节点超过半数时,事务即可进行下去,否则不能。1. 锁过程中的同步检测对于锁协议:锁协议是一阶段同步协议。mnesia_locker.erl
  74. wlock(Tid, Store, Oid) -> wlock(Tid, Store, Oid, _CheckMajority = true).wlock(Tid, Store, Oid, CheckMajority) -> {Tab, Key} = Oid, case need_lock(Store, Tab, Key, write) of yes -> {Ns, Majority} = w_nodes(Tab), if CheckMajority -> check_majority(Majority, Tab, Ns); true -> ignore end, Op = {self(), {write, Tid, Oid}}, ?ets_insert(Store, {{locks, Tab, Key}, write}), get_wlocks_on_nodes(Ns, Ns, Store, Op, Oid); no when Key /= ?ALL, Tab /= ?GLOBAL -> []; no -> element(2, w_nodes(Tab)) end.w_nodes(Tab) -> case ?catch_val({Tab, where_to_wlock}) of {[_ | _], _} = Where -> Where; _ -> mnesia:abort({no_exists, Tab}) end.check_majority(true, Tab, HaveNs) -> check_majority(Tab, HaveNs);check_majority(false, _, _) -> ok.check_majority(Tab, HaveNs) -> case ?catch_val({Tab, majority}) of true -> case mnesia_lib:have_majority(Tab, HaveNs) of true -> ok; false -> mnesia:abort({no_majority, Tab}) end; _ -> ok
  75. end.可以看出上写锁时,需要根据表的 where_to_wlock 属性,确定是否需要进行 majority 检查,where_to_wlock 属性是一个动态属性,当有节点加入退出时,该属性也随之更改。这个检查过程与 schema 表操作对 majority 的检查相同,均为超过半数时才同意。2. 事务过程中的同步检测对于事务协议:事务协议是二阶段同步一阶段异步协议,由于其提交过程已经在前面叙述过,这里仅列出其majority 检查的过程:multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, …可以看出提交事务时,需要根据表的 where_to_commit 属性,确定是否需要进行 majority检查,where_to_commit 属性是一个动态属性,当有节点加入退出时,该属性也随之更改。锁和事务阶段将 majority 检查作为前置条件不变式进行检查,可以快速的确定事务是否可以进行。对锁和事务协议进行不变式分析:1. 确定锁协议的参与节点为表的 where_to_wlock 属性;2. 若锁协议 majority 检查不通过,则记录无法上锁,事务退出;
  76. 3. 若锁协议 majority 检查通过,此时锁协议的参与节点已经确定,若任何一个参与节点退 出,则该退出能在锁协议的同步交互阶段被检测出来,从而导致上锁失败,事务退出;4. 若锁协议 majority 检查通过,请求节点在同步锁请求过程中退出,对于已上锁的参与节 点,锁会因超时而被清除,对于未上锁的参与节点,没有任何影响;5. 若锁协议 majority 检查通过,所有参与节点同意上锁,之后的过程将由事务提交过程接 管,若此后某个参与节点退出,也不会影响到事务协议;6. 确定事务协议的参与节点为表的 where_to_commit 属性,事务协议的参与节点的确立晚 于锁协议参与节点,且与其无关,因此任何在阶段 5 之后有任何节点退出退出,均不会 影响事务协议,事务协议单独决策;7. 若事务协议 majority 检查不通过,则记录无法提交,事务退出;8. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若任何一个参与节 点在 prepare、precommit 阶段退出,则该退出能在事务协议的同步交互阶段被检测出来, 从而导致提交失败,事务退出;9. 若事务协议 majority 检查通过, 请求节点在 prepare、 precommit 阶段退出,对于已 prepare、 precommit 的参与节点,因没有进行实际的提交,不会有任何实际状态的改变,事务描 述符会因超时而被清除,对于未 prepare、precommit 的参与节点,没有任何影响;10. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若任何一个参与节 点在 commit 阶段退出,此时其它参与节点将进行提交,退出节点无法提交,但是能在 恢复时,从已提交节点处获取提交数据,进行自身的提交;11. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若请求节点在 commit 阶段退出,mnesia_monitor 检查到请求节点退出, mnesia_tm 通告 mnesia_down 向 消息,mnesia_tm 在收到了 mnesia_down 消息后,会通过 mnesia_recover 询问其它节点
  77. 是否有提交: a) 若 commit 还未进行,各个参与节点彼此询问后,得到的结果仍然是未决的,则一 致认为事务结果为 abort,并向其它节点广播自身的 abort 结果,事务退出,不会 出现不一致状态; b) 若 commit 已经进行了一部分,此时集群中仅存在二类参与节点:已提交的,未决 的。未决的参与节点在询问到已提交的参与节点时,已提交的节点会返回 commit 的结果,未决节点也因此可以 commit。 c) 若 commit 完成,各个参与节点均已提交,不会出现不一致状态;确定事务结果为退出,并向其它节点广播自身的 abort 结果,而已提交的参与节点确定事务结果为提交,此时出现不一致,已提交节点产生{inconsistent_database, bad_decision, Node}消息,需要某种决策解决这个问题;但由于在第三阶段提交时,请求者已经几乎没有什么操作,仅仅是异步广播一条 {Tid,committed}消息,因此,此处出现不一致的情况微乎其微;inconsistent_database 消 息 还 会 出 现 在 运 行 时 : {inconsistent_database,running_partitioned_network, Node}和重新启动时:{inconsistent_database, starting_partitioned_network, Node}。由于这些策略的存在,上述不变式可以处理多个节点退出的情况。由于前两个阶段未提交,因此不会出现不一致的状态,而在第三阶段中:1. 仅请求节点退出,各个参与节点检测到请求节点的退出,mnesia_tm 开始询问其它节点 的事务结果,从全局角度来看: a) 若没有参与节点提交,则所有参与节点都认为该事务 abort; b) 若一部分参与节点提交,则未决参与节点询问提交参与节点后,得到提交结果,最
  78. 终事务提交; c) 请求节点在重启时,将根据本地恢复日志确定提交的结果: i. 若没有第一阶段恢复日志 unclear,则事务被认为是中止; ii. 若仅有第一阶段恢复日志 unclear,需等待其他参与节点的结果: 1. 若其他参与节点提交,则根据其他参与节点的提交结果进行提交; 2. 若其他参与节点中止,则根据其他参与节点的中止结果进行中止; 3. 若所有参与节点均不知道结果,也即所有参与节点均仅有第一阶段恢复日 志 unclear,此时认为事务中止(代码注释中写明提交,但经过试验却发 现并非如此,因为 unclear 日志不会落盘) ; iii. 若已有第二阶段恢复日志 committed,则事务被认为是提交;2. 请求节点不退出,仅部分/全部参与节点退出,请求节点与各个存活参与节点检测到请 求节点的退出,mnesia_tm 开始询问其它节点的事务结果,从全局角度来看: a) 第三阶段时,请求节点发出 commit 消息前后,会立即进行提交,其它未决参与节 点至少可以根据请求节点的提交结果,决定提交事务; b) 退出的参与节点在重启时,将根据本地恢复日志和请求节点确定提交的结果,过程 如{1,c}; c) 全部参与节点退出是 b 的特例,各个退出的参与节点在重启时,将根据本地恢复日 志和请求节点确定提交的结果,过程如{1,c};3. 请求节点与参与节点都有退出: a) 若请求节点在发出 commit 消息之前退出,提交还未开始,参与节点无论是否有退 出实际情况均为{1,a},各个节点重启时如{1,c}; b) 若请求节点在发出 commit 消息之后,没有开始提交或部分提交,然后退出,且参
  79. 与节点保持如下三种状态: i. 存活参与节点未能收到 commit,并存活; ii. 存活参与节点未能收到 commit,并退出; iii. 收到 commit 消息的参与节点未开始提交时便退出; 则实际情况为{1,a},各个节点重启时如{1,c,i}; c) 若请求节点在发出 commit 消息之后,且提交完成,然后退出,其它参与节点保持 如下三种状态: i. 存活参与节点未能收到 commit,并存活; ii. 存活参与节点未能收到 commit,并退出; iii. 收到 commit 消息的参与节点未开始提交时便退出; 则实际情况为{2,a},各个节点重启时如{1,c,ii,1}; d) 若请求节点在发出 commit 消息之后,没有开始提交或部分提交,然后退出,且有 至少一个参与节点提交完成,则实际情况为{1,b},各个节点重启时如{1,c,ii, 1};4. 一个特殊的情况,场景极难构造,可能造成不一致,但若出现这种情况,则只能是 mnesia 自身的 bug: 这个场景难于构造在 d 步骤,参与节点的其它 mnesia 模块通常也会立即收到 mnesia_tm 引发的{EXIT, Pid, Reason}消息,并且立即退出。 a) 请求节点在发出 commit 消息之后,且提交完成; b) 参与节点的 commit_participant 进程在收到 commit 消息前; c) 参与节点的 mnesia_tm 先一步退出, commit_participant 进程会收到{EXIT, Pid, Reason}消息;
  80. d) 参与节点的其它 mnesia 模块还未收到 mnesia_tm 引发的{EXIT, Pid, Reason}消息, 仍在正常工作; e) 参与节点的 commit_participant 进程将决定中止事务,此时出现不一致; f) 再次访问 mnesia 时, mnesia 将抛出{inconsistent_database, bad_decision, Node}消息; g) 重启 mnesia 时协商恢复。3. 节点 down 异步检测1. 检测原理在 mnesia 没有处理任何事务的情况下,若此时 erlang 虚拟机检测到任何节点退出,mnesia需要进行网络分区检查,但是这个检查的流程有些特殊:mnesia_monitor.erlhandle_call(init, _From, State) -> net_kernel:monitor_nodes(true), EarlyNodes = State#state.early_connects, State2 = State#state{tm_started = true}, {reply, EarlyNodes, State2};启动时,mnesia_monitor 将监听节点 up/down 的消息。handle_info({nodedown, _Node}, State) -> %% Ignore, we are only caring about nodeups {noreply, State};mnesia 并不依赖于 nodedown 消息处理节点退出,而是在新节点加入时,通过 link 新节点上的 mnesia_monitor 进程进行节点退出检查:handle_info(Msg = {EXIT,Pid,_}, State) -> Node = node(Pid), if Node /= node(), State#state.connecting == undefined -> %% Remotly linked process died, assume that it was a mnesia_monitor mnesia_recover:mnesia_down(Node),
  81. mnesia_controller:mnesia_down(Node), {noreply, State#state{going_down = [Node | State#state.going_down]}}; Node /= node() -> {noreply, State#state{mq = State#state.mq ++ [{info, Msg}]}}; true -> %% We have probably got an exit signal from %% disk_log or dets Hint = "Hint: check that the disk still is writable", fatal("~p got unexpected info: ~p; ~p~n", [?MODULE, Msg, Hint]) end;handle_cast({mnesia_down, mnesia_controller, Node}, State) -> mnesia_tm:mnesia_down(Node), {noreply, State};handle_cast({mnesia_down, mnesia_tm, {Node, Pending}}, State) -> mnesia_locker:mnesia_down(Node, Pending), {noreply, State};handle_cast({mnesia_down, mnesia_locker, Node}, State) -> Down = {mnesia_down, Node}, mnesia_lib:report_system_event(Down), GoingDown = lists:delete(Node, State#state.going_down), State2 = State#state{going_down = GoingDown}, Pending = State#state.pending_negotiators, case lists:keysearch(Node, 1, Pending) of {value, {Node, Mon, ReplyTo, Reply}} -> %% Late reply to remote monitor link(Mon), %% link to remote Monitor gen_server:reply(ReplyTo, Reply), P2 = lists:keydelete(Node, 1,Pending), State3 = State2#state{pending_negotiators = P2}, process_q(State3); false -> %% No pending remote monitors process_q(State2) end;当发现有某个节点的 mnesia_monitor 进程退出时,这时候要依次通知 mnesia_recover、mnesia_controller、mnesia_tm、mnesia_locker、监听 mnesia_down 消息的进程,告知某个节点的 mnesia 退出,最后进行 mnesia_monitor 本身对 mnesia_down 消息的处理。这些处理阶段主要包括:1. mnesia_recover:运行时无动作,仅在初始化阶段对未决事务进行检查;
  82. 2. mnesia_controller: a) 通知 mnesia_recover 记录 mnesia_down 历史事件,以便将来退出节点重新连接时 进行分区检查; b) 修 改 所 有 与 退 出 节 点 相 关 的 表 的 几 项 动 态 节 点 属 性 ( where_to_commit 、 where_to_write、where_to_wlock、active_replicas) 以保持表的全局拓扑的正确性; ,3. mnesia_tm:重新配置退出节点参与的事务,若退出节点参与了本节点的 coordinator 组 织的事务,则进行 coordinator 校正,通知 coordinator 一个 mnesia_down 消息,而 coordinator 根据提交阶段进行修正,选择中止事务或提交事务;若退出节点参与了本节 点的 participant 参与的事务,则进行 participant 校正,告知其它节点该退出节点的状态;4. mnesia_locker : 清 除 四 张 锁 表 中 ( mnesia_lock_queue 、 mnesia_held_locks , mnesia_sticky_locks,mnesia_tid_locks),与退出节点相关的锁;5. 上层应用:mnesia 向订阅了 mnesia_down 消息的应用投递该消息;2. mnesia_recover 处理 mnesia_down 消息mnesia_recover.erlmnesia_down(Node) -> case ?catch_val(recover_nodes) of {EXIT, _} -> %% Not started yet ignore; _ -> mnesia_lib:del(recover_nodes, Node), cast({mnesia_down, Node}) end.handle_cast({mnesia_down, Node}, State) -> case State#state.unclear_decision of undefined -> {noreply, State};
  83. D -> case lists:member(Node, D#decision.ram_nodes) of false -> {noreply, State}; true -> State2 = add_remote_decision(Node, D, State), {noreply, State2} end end;unclear_decision 仅用于启动时对之前的未决事务进行恢复,运行时期总为 undefined。3. mnesia_controller 处理 mnesia_down 消息mnesia_controller.erlmnesia_down(Node) -> case cast({mnesia_down, Node}) of {error, _} -> mnesia_monitor:mnesia_down(?SERVER_NAME, Node); _Pid -> ok end.handle_cast({mnesia_down, Node}, State) -> maybe_log_mnesia_down(Node), …maybe_log_mnesia_down(N) -> case mnesia_lib:is_running() of yes -> verbose("Logging mnesia_down ~w~n", [N]), mnesia_recover:log_mnesia_down(N), ok; … end.mnesia_controller 对 mnesia_down 的 处 理 过 程 中 , 会 通 知 mnesia_recover 记 录 这 个mnesia_down 事件,以备在该节点启动时进行额外的网络分区检查工作:mnesia_recover.erllog_mnesia_down(Node) -> call({log_mnesia_down, Node}).handle_call({log_mnesia_down, Node}, _From, State) -> do_log_mnesia_down(Node),
  84. {reply, ok, State};do_log_mnesia_down(Node) -> Yoyo = {mnesia_down, Node, Date = date(), Time = time()}, case mnesia_monitor:use_dir() of true -> mnesia_log:append(latest_log, Yoyo), disk_log:sync(latest_log); false -> ignore end, note_down(Node, Date, Time).note_down(Node, Date, Time) -> ?ets_insert(mnesia_decision, {mnesia_down, Node, Date, Time}).mnesia_recover 除了记录 mnesia_down 日志外,还会在 ets 表 mnesia_decision 中记录节点的down 历史记录,以备在该节点重新 up 时做检查,这样可以避免由于瞬断 ABA 问题导致的不一致。若不记录历史记录,则在如下情况发生时,将造成不一致:1. 集群由 A、B、C 组成2. A 由于网络问题,出现瞬断,B、C 完好3. B、C 写入数据,满足 majority 条件4. A 网络恢复,并未记录 B、C 出现 mnesia_down 的情况,仍然认为自己在集群中,此时 数据不一致mnesia_controller.erlhandle_cast({mnesia_down, Node}, State) -> … mnesia_lib:del({current, db_nodes}, Node), mnesia_lib:unset({node_up, Node}), mnesia_checkpoint:tm_mnesia_down(Node), Alltabs = val({schema, tables}), reconfigure_tables(Node, Alltabs), …reconfigure_tables(N, [Tab |Tail]) -> del_active_replica(Tab, N),
  85. case val({Tab, where_to_read}) of N -> mnesia_lib:set_remote_where_to_read(Tab); _ -> ignore end, reconfigure_tables(N, Tail);reconfigure_tables(_, []) -> ok.del_active_replica(Tab, Node) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), New = lists:sort(Del), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:del({Tab, active_replicas}, Node), mnesia_lib:del({Tab, where_to_write}, Node), update_where_to_wlock(Tab).update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}).除了记录历史 mnesia_down,mnesia_controller 还要:1.更新全局 db_nodes;2.更新表的各类节点信息,这些信息包括:where_to_commit,where_to_write,where_to_wlock,active_replicas;由于出现 mnesia_down 后,将会立即更新表的 active_replicas 属性,这也给如下策略提供了依据:通过监控 schema 表的 active_replicas,若发现其与配置的 schema 表的 disc 节点不符,且不在 schema 表的 active_replicas 中的 disc 节点却位于 erlang:nodes/0 的列表中,则 ping 这些不 符 的 节 点 , 若 这 些 节 点 可 以 ping 通 , 则 认 为 曾 经 出 现 了 集 群 分 区 , 而 检 测inconsistent_database 消息的进程错过了某个 inconsistent_database 消息,这时可以进行mnesia 的重启和重新协商。
  86. 4. mnesia_tm 处理 mnesia_down 消息mnesia_down(Node) -> case whereis(?MODULE) of undefined -> mnesia_monitor:mnesia_down(?MODULE, {Node, []}); Pid -> Pid ! {mnesia_down, Node} end.doit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) -> receive … {mnesia_down, N} -> verbose("Got mnesia_down from ~p, reconfiguring...~n", [N]), reconfigure_coordinators(N, gb_trees:to_list(Coordinators)), Tids = gb_trees:keys(Participants), reconfigure_participants(N, gb_trees:values(Participants)), NewState = clear_fixtable(N, State), mnesia_monitor:mnesia_down(?MODULE, {N, Tids}), doit_loop(NewState); … end.mnesia_tm 需要重新配置 coordinator 与 participant,coordinator 用于请求者,participant 用于充当三阶段提交的参与者。mnesia_tm 重设 coordinator:reconfigure_coordinators(N, [{Tid, [Store | _]} | Coordinators]) -> case mnesia_recover:outcome(Tid, unknown) of committed -> WaitingNodes = ?ets_lookup(Store, waiting_for_commit_ack), case lists:keymember(N, 2, WaitingNodes) of false -> ignore; % avoid spurious mnesia_down messages true -> send_mnesia_down(Tid, Store, N) end;
  87. aborted -> ignore; % avoid spurious mnesia_down messages _ -> %% Tell the coordinator about the mnesia_down send_mnesia_down(Tid, Store, N) end, reconfigure_coordinators(N, Coordinators);reconfigure_coordinators(_N, []) -> ok.send_mnesia_down(Tid, Store, Node) -> Msg = {mnesia_down, Node}, send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).reconfigure_coordinators 主要服务于请求者, mnesia_recover 保存了最近的事务的活动状态,对于未明结果的事务,需要向其请求者发送 mnesia_down 消息, 告知其某个参与者节点 down。若此时没有活动事务,节点也未参与任何活动事务,此时 mnesia_tm 不需要重设 coordinator与 participant。请求者处第一阶段提交对 mnesia_down 的处理:rec_all([Node | Tail], Tid, Res, Pids) -> receive … {mnesia_down, Node} -> %% Make sure that mnesia_tm knows it has died %% it may have been restarted Abort = {do_abort, {bad_commit, Node}}, catch {?MODULE, Node} ! {Tid, Abort}, rec_all(Tail, Tid, Abort, Pids) end;请求者处第二阶段提交对 mnesia_down 的处理:rec_acc_pre_commit([Pid | Tail], Tid, Store, Commit, Res, DumperMode, GoodPids, SchemaAckPids) -> receive … {mnesia_down, Node} when Node == node(Pid) -> AbortRes = {do_abort, {bad_commit, Node}}, catch Pid ! {Tid, AbortRes}, %% Tell him that he has died rec_acc_pre_commit(Tail, Tid, Store, Commit, AbortRes, DumperMode, GoodPids, SchemaAckPids)
  88. end;请求者处第三阶段提交对 mnesia_down 的处理:对于普通操作,第三阶段是异步的,这时已经不需要监控 mnesia_down 消息了。对于 schema 操作,第三阶段是同步的:sync_schema_commit(_Tid, _Store, []) -> ok;sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.此处仍然参考三阶段提交,执行到这个位置时,请求者的 schema 已经更新完毕,而其它参与者即使 down 了,也可以在重启后恢复。mnesia_tm 重设 participant:reconfigure_participants(N, [P | Tail]) -> case lists:member(N, P#participant.disc_nodes) or lists:member(N, P#participant.ram_nodes) of false -> reconfigure_participants(N, Tail); true -> Tid = P#participant.tid, if node(Tid#tid.pid) /= N -> reconfigure_participants(N, Tail); true -> verbose("Coordinator ~p in transaction ~p died~n", [Tid#tid.pid, Tid]), Nodes = P#participant.disc_nodes ++ P#participant.ram_nodes, AliveNodes = Nodes -- [N], Protocol = P#participant.protocol, tell_outcome(Tid, Protocol, N, AliveNodes, AliveNodes), reconfigure_participants(N, Tail) end end;reconfigure_participants(_, []) -> [].
  89. tell_outcome(Tid, Protocol, Node, CheckNodes, TellNodes) -> Outcome = mnesia_recover:what_happened(Tid, Protocol, CheckNodes), case Outcome of aborted -> rpc:abcast(TellNodes, ?MODULE, {Tid,{do_abort, {mnesia_down, Node}}}); committed -> rpc:abcast(TellNodes, ?MODULE, {Tid, do_commit}) end, Outcome.mnesia_recover.erlwhat_happened(Tid, Protocol, Nodes) -> Default = case Protocol of asym_trans -> aborted; _ -> unclear %% sym_trans and sync_sym_trans end, This = node(), case lists:member(This, Nodes) of true -> {ok, Outcome} = call({what_happened, Default, Tid}), Others = Nodes -- [This], case filter_outcome(Outcome) of unclear -> what_happened_remotely(Tid, Default, Others); aborted -> aborted; committed -> committed end; false -> what_happened_remotely(Tid, Default, Nodes) end.handle_call({what_happened, Default, Tid}, _From, State) -> sync_trans_tid_serial(Tid), Outcome = outcome(Tid, Default), {reply, {ok, Outcome}, State};what_happened_remotely(Tid, Default, Nodes) -> {Replies, _} = multicall(Nodes, {what_happened, Default, Tid}), check_what_happened(Replies, 0, 0).check_what_happened([H | T], Aborts, Commits) -> case H of {ok, R} -> case filter_outcome(R) of committed -> check_what_happened(T, Aborts, Commits + 1); aborted ->
  90. check_what_happened(T, Aborts + 1, Commits); unclear -> check_what_happened(T, Aborts, Commits) end; {error, _} -> check_what_happened(T, Aborts, Commits); {badrpc, _} -> check_what_happened(T, Aborts, Commits) end;check_what_happened([], Aborts, Commits) -> if Aborts == 0, Commits == 0 -> aborted; % None of the active nodes knows Aborts > 0 -> aborted; % Someody has aborted Aborts == 0, Commits > 0 -> committed % All has committed end.首先询问本地 mnesia_recover,检查其是否保存着事务决议结果,若 mnesia_recover 有决议结果,则使用 mnesia_recover 的决议结果;否则询问其它参与节点决议结果。向其它存活的参与节点询问事务决议结果,可以看出,策略如下:若其他节点中有任何节点abort 了,则事务结果为 abort;若其他节点中没有节点 abort 或 commit,则事务结果为 abort;若其他节点中没有节点 abort 且至少有一个节点 commit,则事务结果为 commit。对于同时出现 abort 和 commit 的情况,mnesia 选择 abort,而在 commit 的参与节点处,由于 abort 的节点会询问其结果,commit 节点发现其与自己的事务结果冲突,会向上报告{inconsistent_database, bad_decision, Node}消息,这需要应用进行数据订正。reconfigure_participants 主要服务于参与者,若参与者得知,其请求者 down,则需要决议事务结果,并向其它节点广播自身发现的结果。参与者处第一、二阶段提交对 mnesia_down 的处理:由于此时没有提交,可以直接令事务中止。参与者处第三阶段提交对 mnesia_down 的处理,且没有任何参与节点开始提交:
  91. 由于此时没有提交,可以直接令事务中止。参与者处第三阶段提交对 mnesia_down 的处理,且有某些参与节点已经提交:由于此时有部分提交,则直接从提交节点处得到事务结果,本节点进行提交。5. mnesia_locker 处理 mnesia_down 消息mnesia_locker.erlmnesia_down(N, Pending) -> case whereis(?MODULE) of undefined -> mnesia_monitor:mnesia_down(?MODULE, N); Pid -> Pid ! {release_remote_non_pending, N, Pending} end.loop(State) -> receive … {release_remote_non_pending, Node, Pending} -> release_remote_non_pending(Node, Pending), mnesia_monitor:mnesia_down(?MODULE, Node), loop(State); … end.release_remote_non_pending(Node, Pending) -> ?ets_match_delete(mnesia_sticky_locks, {_ , Node}), AllTids = ?ets_match(mnesia_tid_locks, {$1, _, _}), Tids = [T || [T] <- AllTids, Node == node(T#tid.pid), not lists:member(T, Pending)], do_release_tids(Tids).do_release_tids([Tid | Tids]) -> do_release_tid(Tid), do_release_tids(Tids);do_release_tids([]) -> ok.do_release_tid(Tid) -> Locks = ?ets_lookup(mnesia_tid_locks, Tid), ?dbg("Release ~p ~p ~n", [Tid, Locks]), ?ets_delete(mnesia_tid_locks, Tid), release_locks(Locks), UniqueLocks = keyunique(lists:sort(Locks),[]),
  92. rearrange_queue(UniqueLocks).release_locks([Lock | Locks]) -> release_lock(Lock), release_locks(Locks);release_locks([]) -> ok.release_lock({Tid, Oid, {queued, _}}) -> ?ets_match_delete(mnesia_lock_queue, #queue{oid=Oid, tid = Tid, op = _, pid = _, lucky = _});release_lock({_Tid, Oid, write}) -> ?ets_delete(mnesia_held_locks, Oid);release_lock({Tid, Oid, read}) -> case ?ets_lookup(mnesia_held_locks, Oid) of [{Oid, Prev, Locks0}] -> case remove_tid(Locks0, Tid, []) of [] -> ?ets_delete(mnesia_held_locks, Oid); Locks -> ?ets_insert(mnesia_held_locks, {Oid, Prev, Locks}) end; [] -> ok end.mnesia_locker 所 作 的 工 作 就 比 较 直 接 , 即 为 清 除 四 张 锁 表 中 ( mnesia_lock_queue 、mnesia_held_locks,mnesia_sticky_locks,mnesia_tid_locks),与退出节点相关的锁。4. 节点 up 异步检测当退出的节点重新加入时,mnesia 作进行网络分区检查:mnesia_monitor.erlhandle_info({nodeup, Node}, State) -> HasDown = mnesia_recover:has_mnesia_down(Node), ImRunning = mnesia_lib:is_running(), if %% If Im not running the test will be made later. HasDown == true, ImRunning == yes -> spawn_link(?MODULE, detect_partitioned_network, [self(), Node]); true -> ignore end, {noreply, State};
  93. 网络分区和不一致的检查过程,将延迟到有新节点加入集群时,若该新节点曾经被本节点认为是 mnesia_down 的,则进行真正的检查过程。mnesia_recover.erlhas_mnesia_down(Node) -> case ?ets_lookup(mnesia_decision, Node) of [{mnesia_down, Node, _Date, _Time}] -> true; [] -> false end.从 ets 表 mnesia_decision 中取回节点的历史记录。mnesia_monitor.erldetect_partitioned_network(Mon, Node) -> detect_inconcistency([Node], running_partitioned_network), unlink(Mon), exit(normal).detect_inconcistency([], _Context) -> ok;detect_inconcistency(Nodes, Context) -> Downs = [N || N <- Nodes, mnesia_recover:has_mnesia_down(N)], {Replies, _BadNodes} = rpc:multicall(Downs, ?MODULE, has_remote_mnesia_down, [node()]), report_inconsistency(Replies, Context, ok).has_remote_mnesia_down(Node) -> HasDown = mnesia_recover:has_mnesia_down(Node), Master = mnesia_recover:get_master_nodes(schema), if HasDown == true, Master == [] -> {true, node()}; true -> {false, node()} end.本节点的检查过程,需要首先向新节点询问,在新节点的拓扑视图中,是否本节点是否也曾经出现过 down 的情况。新节点也将检查自己的历史记录,查看是否本节点曾经 down 过,并返回检查结果。注意,若配置了 master 节点选项,则可以通过 master 节点进行仲裁,可以不被认为 down 过。本节点收到结果后,进行实际的分区检查:
  94. report_inconsistency([{true, Node} | Replies], Context, _Status) -> Msg = {inconsistent_database, Context, Node}, mnesia_lib:report_system_event(Msg), report_inconsistency(Replies, Context, inconsistent_database);…若新加入节点认为本节点曾经 down 过,而此时本节点也认为新节点也 down 过,此时 mnesia存在潜在的不一致状态,此时必须通知应用,报告这个不一致消息,此时 Context 为running_partitioned_network,这也意味着 mnesia 是在运行过程中发现的网络分区。除了运行时通过 nodeup 消息对分区进行检查,还需要启动时对分区进行检查,否则在网络分区出现后,本节点关闭再重启,mnesia_down 临时历史记录消失(在日志中仍然记录),无法通过 nodeup 消息进行分区检查。mnesia_recover.erlconnect_nodes(Ns) -> call({connect_nodes, Ns}).handle_call({connect_nodes, Ns}, From, State) -> AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, case mnesia_monitor:negotiate_protocol(Check) of busy -> erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> mnesia_lib:add_list(recover_nodes, GoodNodes), cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State}
  95. end;启动过程中,需要与其它节点进行连接,然后进行协议版本协商,若版本兼容,之后将可以与该节点进行交互,若没有 master 节点配置,则将进行分区检查,检查方法仍然使用mnesia_monitor:detect_inconcistency 进行,询问远程节点, mnesia_down 历史记录中是否 其记载了本节点,若有,则存在潜在的不一致,此时的上下文为 starting_partitioned_network。8. 其它

×