mnesia脑裂问题综述

2,280 views

Published on

mnesia脑裂问题综述

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,280
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
87
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

mnesia脑裂问题综述

  1. 1. 目录1. 现象与成因............................................................................................................................... 22. mnesia 运行机制 ...................................................................................................................... 33. 常见问题与注意事项............................................................................................................... 64. 源码分析................................................................................................................................... 8 1. mnesia:create_schema/1 的工作过程 ............................................................................. 8 1. 主体过程................................................................................................................... 8 2. 前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9 3. 后半部分 mnesia:start/0 做的工作 ....................................................................... 19 4. mnesia:change_table_majority/2 的工作过程 .............................................................. 23 1. 调用接口................................................................................................................. 23 2. 事务操作................................................................................................................. 24 3. schema 事务提交接口 ........................................................................................... 29 4. schema 事务协议过程 ........................................................................................... 31 5. 远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34 6. 远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37 7. 请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37 8. 远程节点事务参与者第三阶段提交 commit 响应............................................... 39 9. 第三阶段提交 commit 的本地提交过程 .............................................................. 39 5. majority 事务处理 .......................................................................................................... 45 6. 恢复................................................................................................................................. 46
  2. 2. 1. 节点协议版本检查+节点 decision 通告与合并.................................................... 46 2. 节点发现,集群遍历 ............................................................................................. 51 3. 节点 schema 合并 .................................................................................................. 60 4. 节点数据合并部分 1,从远程节点加载表 .......................................................... 62 5. 节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66 6. 节点数据合并部分 2,表加载完成 ...................................................................... 71 7. 分区检测......................................................................................................................... 73 1. 锁过程中的同步检测 ............................................................................................. 73 2. 事务过程中的同步检测 ......................................................................................... 75 3. 节点 down 异步检测.............................................................................................. 80 4. 节点 up 异步检测................................................................................................... 92 8. 其它................................................................................................................................. 95分析代码版本为 erlang 版本 R15B03。1. 现象与成因现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态,分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活的分区中拉取数据,自身原先的数据丢失。原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
  3. 3. 性)约束,一些分布式存储系统为保证可用性,放弃强一致转而追求最终一致。mnesia 也是最终一致的分布式数据库,在没有分区的时候,mnesia 为强一致的,而出现分区后,mnesia 仍然允许写入,因此将呈现不一致的状态。分区消除后,需要应用者处理不一致的状态。简单的恢复过程如重启被放弃的分区,令其重新从保留的分区拉取数据,复杂的恢复过程则需要编写数据订正程序,应用订正程序进行恢复。2. mnesia 运行机制mnesia 运行机制状态图,事务过程采用 majority 事务,即当大多数节点在集群中时,才允许写:mnesia 运行机制解释:
  4. 4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型: a) 无锁无事务脏写,一阶段异步; b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务; c) 有锁同步事务,一阶段同步锁,两阶段同步事务; d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务; e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority 事务;2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商 工作: a) 节点发现; b) 节点协议版本协商; c) 节点 schema 合并; d) 节点事务 decision 合并; i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告 {inconsistent_database, bad_decision, Node},本节点事务结果改为 abort; ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为 abort,此时远程节点将进行修改和通报; iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节 点事务结果,远程节点进行修改; iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务 结果的节点启动,并按照其结果作为事务结果; v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
  5. 5. vi. 事务 decision 并不真正影响实际的数据内容; e) 节点表数据合并: i. 若本节点为 master 节点,则本节点从磁盘加载表数据; ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据; iii. 若远程节点存活,则从远程节点拉取表数据; iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数 据; v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动 加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访 问; vi. 若表数据已经加载,则不会再从远程节点拉取表数据; vii. 从集群角度看: 1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图; 2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视 , 图; 3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图, 各个分区依旧保持分区状态;3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对 事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不 一致,此时将通告应用者一个 inconsistent_database 事件: a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
  6. 6. Node}; b) 重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network, Node}; c) 运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit 事务,即通告{inconsistent_database, bad_decision, Node};3. 常见问题与注意事项此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完备一些,也不包含 schema 操作。fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。常见问题:1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保 持 fail_safe 状态? 若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority, 此时集群变为可写; 若没有任何其他节点启动,则本节点一致保持 fail_safe 状态;2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后, 在 minority 分区写入,此时 minority 如何进入 fail_safe 状态? mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机 将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。 而
  7. 7. 双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态;3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A, 有什么效果?重启 B、C 有什么效果? 经过试验得出: a) 若重启 A,则在 A 中能正确发现 B、 写入的记录, C 这依赖于 A 启动时的协商过程, A 向 B、C 请求表数据; b) 若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协 商过程,B、C 向 A 请求表数据;注意事项:1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master 节点,由他来仲裁最终的数据结果,但这样也会引入单点问题;2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动, 时 一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件;3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅;4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行 事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能;5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点 表的恢复依据;6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报;7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
  8. 8. 4. 源码分析主题包括:1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1 的工作过程;2. 此处使用 majority 事务进行解释, 必须参考 mnesia:change_table_majority/2 的工作过程, 且此过程是 schema 事务,可以更详细全面的理解 majority 事务;3. majority 事务处理将弱化 schema 事务模型,进行特定的解释;4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载;5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件;1. mnesia:create_schema/1 的工作过程1. 主体过程安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。schema 添加的过程本质上是一个两阶段提交过程:schema 变更发起节点1. 询问各个参与节点是否已经由 schema 副本2. 上全局锁{mnesia_table_lock, schema}3. 在各个参与节点上建立 mnesia_fallback 进程4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保 存新生成的 schema 文件备份
  9. 9. 5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创 建真正的"FALLBACK.BUP"文件6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更2. 前半部分 mnesia:create_schema/1 做的工作mnesia.erlcreate_schema(Ns) -> mnesia_bup:create_schema(Ns).mnesia_bup.erlcreate_schema([]) -> create_schema([node()]);create_schema(Ns) when is_list(Ns) -> case is_set(Ns) of true -> create_schema(Ns, mnesia_schema:ensure_no_schema(Ns)); false -> {error, {combine_error, Ns}} end;create_schema(Ns) -> {error, {badarg, Ns}}.mnesia_schema.erlensure_no_schema([H|T]) when is_atom(H) -> case rpc:call(H, ?MODULE, remote_read_schema, []) of {badrpc, Reason} -> {H, {"All nodes not running", H, Reason}}; {ok,Source, _} when Source /= default -> {H, {already_exists, H}}; _ -> ensure_no_schema(T) end;ensure_no_schema([H|_]) -> {error,{badarg, H}};ensure_no_schema([]) -> ok.remote_read_schema() -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok ->
  10. 10. case mnesia_monitor:get_env(schema_location) of opt_disc -> read_schema(false); _ -> read_schema(false) end; {error, Reason} -> {error, Reason} end.询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。回到 mnesia_bup.erlmnesia_bup.erlcreate_schema(Ns, ok) -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok -> case mnesia_monitor:get_env(schema_location) of ram -> {error, {has_no_disc, node()}}; _ -> case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of {error, What} -> {error, What}; ok -> Mod = mnesia_backup, Str = mk_str(), File = mnesia_lib:dir(Str), file:delete(File), case catch make_initial_backup(Ns, File, Mod) of {ok, _Res} -> case do_install_fallback(File, Mod) of ok -> file:delete(File), ok; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end
  11. 11. end end; {error, Reason} -> {error, Reason} end;create_schema(_Ns, {error, Reason}) -> {error, Reason};create_schema(_Ns, Reason) -> {error, Reason}.通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema:make_initial_backup(Ns, Opaque, Mod) -> Orig = mnesia_schema:get_initial_schema(disc_copies, Ns), Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)), Schema = [{schema, schema, Modded}], O2 = do_apply(Mod, open_write, [Opaque], Opaque), O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2), O4 = do_apply(Mod, write, [O3, Schema], O3), O5 = do_apply(Mod, commit_write, [O4], O4), {ok, O5}.创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份中。mnesia_schema.erlget_initial_schema(SchemaStorage, Nodes) -> Cs = #cstruct{name = schema, record_name = schema, attributes = [table, cstruct]}, Cs2 = case SchemaStorage of ram_copies -> Cs#cstruct{ram_copies = Nodes}; disc_copies -> Cs#cstruct{disc_copies = Nodes} end, cs2list(Cs2).mnesia_bup.erldo_install_fallback(Opaque, Mod) when is_atom(Mod) -> do_install_fallback(Opaque, [{module, Mod}]);do_install_fallback(Opaque, Args) when is_list(Args) -> case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
  12. 12. {ok, FA} -> do_install_fallback(FA); {error, Reason} -> {error, Reason} end;do_install_fallback(_Opaque, Args) -> {error, {badarg, Args}}.检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在check_fallback_arg_type/2 中,然后进行安装check_fallback_args([Arg | Tail], FA) -> case catch check_fallback_arg_type(Arg, FA) of {EXIT, _Reason} -> {error, {badarg, Arg}}; FA2 -> check_fallback_args(Tail, FA2) end;check_fallback_args([], FA) -> {ok, FA}.check_fallback_arg_type(Arg, FA) -> case Arg of {scope, global} -> FA#fallback_args{scope = global}; {scope, local} -> FA#fallback_args{scope = local}; {module, Mod} -> Mod2 = mnesia_monitor:do_check_type(backup_module, Mod), FA#fallback_args{module = Mod2}; {mnesia_dir, Dir} -> FA#fallback_args{mnesia_dir = Dir, use_default_dir = false}; {keep_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{keep_tables = Tabs}; {skip_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{skip_tables = Tabs}; {default_op, keep_tables} -> FA#fallback_args{default_op = keep_tables}; {default_op, skip_tables} -> FA#fallback_args{default_op = skip_tables} end.
  13. 13. 此处的构造过程记录 module 参数, mnesia_backup, 为 同时记录 opaque 参数, 为新建 schema文件的文件名。do_install_fallback(FA) -> Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]), Res = receive {EXIT, Pid, Reason} -> % if appl has trapped exit {error, {EXIT, Reason}}; {Pid, Res2} -> case Res2 of {ok, _} -> ok; {error, Reason} -> {error, {"Cannot install fallback", Reason}} end end, Res.install_fallback_master(ClientPid, FA) -> process_flag(trap_exit, true), State = {start, FA}, Opaque = FA#fallback_args.opaque, Mod = FA#fallback_args.module, Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)), unlink(ClientPid), ClientPid ! {self(), Res}, exit(shutdown).从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup,Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。fallback_args 默认定义:-record(fallback_args, {opaque, scope = global, module = mnesia_monitor:get_env(backup_module), use_default_dir = true, mnesia_dir, fallback_bup, fallback_tmp, skip_tables = [],
  14. 14. keep_tables = [], default_op = keep_tables }).iterate(Mod, Fun, Opaque, Acc) -> R = #restore{bup_module = Mod, bup_data = Opaque}, case catch read_schema_section(R) of {error, Reason} -> {error, Reason}; {R2, {Header, Schema, Rest}} -> case catch iter(R2, Header, Schema, Fun, Acc, Rest) of {ok, R3, Res} -> catch safe_apply(R3, close_read, [R3#restore.bup_data]), {ok, Res}; {error, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, Reason}; {EXIT, Pid, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Pid, Reason}}; {EXIT, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Reason}} end end.iter(R, Header, Schema, Fun, Acc, []) -> case safe_apply(R, read, [R#restore.bup_data]) of {R2, []} -> Res = Fun([], Header, Schema, Acc), {ok, R2, Res}; {R2, BupItems} -> iter(R2, Header, Schema, Fun, Acc, BupItems) end;iter(R, Header, Schema, Fun, Acc, BupItems) -> Acc2 = Fun(BupItems, Header, Schema, Acc), iter(R, Header, Schema, Fun, Acc2, []).read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数:restore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2),
  15. 15. case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}}); Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;一个典型的 schema 结构如下:[{schema,schema, [{name,schema}, {type,set}, {ram_copies,[]}, {disc_copies,[rds_la_dev@10.232.64.77]}, {disc_only_copies,[]}, {load_order,0}, {access_mode,read_write}, {index,[]}, {snmp,[]}, {local_content,false}, {record_name,schema}, {attributes,[table,cstruct]}, {user_properties,[]}, {frag_properties,[]}, {cookie,{{1358,676768,107058},rds_la_dev@10.232.64.77}}, {version,{{2,0},[]}}]}]构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将CreateList 还原回 schema 的 cstruct 结构。mnesia_bup.erlrestore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2), case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}});
  16. 16. Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema构建的节点。构建过程要加入集群的全局锁{mnesia_table_lock, schema}。在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程, 处理 schema 的变更。向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程:restore_recs([], _Header, _Schema, Pids) -> send_fallback(Pids, swap), send_fallback(Pids, stop), stop;restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。进入 fallback_receiver 进程的处理过程:fallback_receiver(Master, FA) -> process_flag(trap_exit, true), case catch register(mnesia_fallback, self()) of {EXIT, _} -> Reason = {already_exists, node()}, local_fallback_error(Master, Reason); true -> FA2 = check_fallback_dir(Master, FA), Bup = FA2#fallback_args.fallback_bup, case mnesia_lib:exists(Bup) of
  17. 17. true -> Reason2 = {already_exists, node()}, local_fallback_error(Master, Reason2); false -> Mod = mnesia_backup, Tmp = FA2#fallback_args.fallback_tmp, R = #restore{mode = replace, bup_module = Mod, bup_data = Tmp}, file:delete(Tmp), case catch fallback_receiver_loop(Master, R, FA2, schema) of {error, Reason} -> local_fallback_error(Master, Reason); Other -> exit(Other) end end end.在自身的节点上注册进程名字为 mnesia_fallback。构建初始化状态。进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。fallback_receiver_loop(Master, R, FA, State) -> receive {Master, {start, Header, Schema}} when State =:= schema -> Dir = FA#fallback_args.mnesia_dir, throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)), R2 = safe_apply(R, open_write, [R#restore.bup_data]), R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]), BupSchema = [schema2bup(S) || S <- Schema], R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]), Master ! {self(), ok}, fallback_receiver_loop(Master, R4, FA, records); … end.在本地也创建一个 schema 临时文件, 接收来自变更发起节点构建的 header 部分和新 schema。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
  18. 18. safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup, Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); … end.mnesia_backup.erlcommit_write(OpaqueData) -> B = OpaqueData, case disk_log:sync(B#backup.file_desc) of ok -> case disk_log:close(B#backup.file_desc) of ok -> case file:rename(B#backup.tmp_file, B#backup.file) of ok -> {ok, B#backup.file}; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end.变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名为真正的新建的 schema 文件名,消除最后的".BUPTMP"fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []), safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup,
  19. 19. Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); …end.在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的active_fallback 属性,表明称为一个活动 fallback 节点。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, stop} when State =:= stop -> stopped; … end.收到 stop 消息后,mnesia_fallback 进程退出。3. 后半部分 mnesia:start/0 做的工作mnesia 启 动 , 则 可 以 自 动 通 过 事 务 管 理 器 mnesia_tm 调 用mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中:mnesia_bup.erltm_fallback_start(IgnoreFallback) -> mnesia_schema:lock_schema(), Res = do_fallback_start(fallback_exists(), IgnoreFallback), mnesia_schema: unlock_schema(), case Res of ok -> ok; {error, Reason} -> exit(Reason) end.锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表锁
  20. 20. do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of … end.根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);init_dat_files(Schema, LocalTabs) -> TmpFile = mnesia_lib:tab2tmp(schema), Args = [{file, TmpFile}, {keypos, 2}, {type, set}], case dets:open_file(schema, Args) of % Assume schema lock {ok, _} -> create_dat_files(Schema, LocalTabs), ok = dets:close(schema), LocalTab = #local_tab{ name = schema, storage_type = disc_copies, open = undefined, add = undefined, close = undefined, swap = undefined, record_name = schema, opened = false}, ?ets_insert(LocalTabs, LocalTab); {error, Reason} -> throw({error, {"Cannot open file", schema, Args, Reason}}) end.创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数据恢复到新建的 schema 的 dets 表中。
  21. 21. 调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后调用之,将其它表的元数据持久化到 schema 表中。restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) -> Tab = element(1, Rec), case ?ets_lookup(LocalTabs, Tab) of [] -> State = {not_local, LocalTabs, Tab}, restore_tables(Recs, Header, Schema, State); [LT] when is_record(LT, local_tab) -> State = {local, LocalTabs, LT}, case LT#local_tab.opened of true -> ignore; false -> (LT#local_tab.open)(Tab, LT), ?ets_insert(LocalTabs,LT#local_tab{opened=true}) end, restore_tables(All, Header, Schema, State) end;打开表,不断检查表是否位于本地,若是则进行恢复添加过程:restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) -> Tab = element(1, Rec), if Tab =:= LT#local_tab.name -> Key = element(2, Rec), (LT#local_tab.add)(Tab, Key, Rec, LT), restore_tables(Recs, Header, Schema, State); true -> NewState = {new, LocalTabs}, restore_tables(All, Header, Schema, NewState) end;Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
  22. 22. 待所有表恢复完成后,进行真正的提交工作:do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of {ok, _Res} -> catch dets:close(schema), TmpSchema = mnesia_lib:tab2tmp(schema), DatSchema = mnesia_lib:tab2dat(schema), AllLT = ?ets_match_object(LocalTabs, _), ?ets_delete_table(LocalTabs), case file:rename(TmpSchema, DatSchema) of ok -> [(LT#local_tab.swap)(LT#local_tab.name, LT) || LT <- AllLT, LT#local_tab.name =/= schema], file:delete(BupFile), ok; {error, Reason} -> file:delete(TmpSchema), {error, {"Cannot start from fallback. Rename error.", Reason}} end; {error, Reason} -> {error, {"Cannot start from fallback", Reason}}; {EXIT, Reason} -> {error, {"Cannot start from fallback", Reason}} end.将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema:mnesia_schema.erl
  23. 23. init(IgnoreFallback) -> Res = read_schema(true, IgnoreFallback), {ok, Source, _CreateList} = exit_on_error(Res), verbose("Schema initiated from: ~p~n", [Source]), set({schema, tables}, []), set({schema, local_tables}, []), Tabs = set_schema(?ets_first(schema)), lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs), set({schema, where_to_read}, node()), set({schema, load_node}, node()), set({schema, load_reason}, initial), mnesia_controller:add_active_replica(schema, node()).检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始信息,并将本节点作为 schema 表的初始活动副本若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须同时包含该节点。4. mnesia:change_table_majority/2 的工作过程mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia:change_table_majority/2 更改此属性。该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。1. 调用接口mnesia.erlchange_table_majority(T, M) -> mnesia_schema:change_table_majority(T, M).
  24. 24. mnesia_schema.erlchange_table_majority(Tab, Majority) when is_boolean(Majority) -> schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end).schema_transaction(Fun) -> case get(mnesia_activity_state) of undefined -> Args = [self(), Fun, whereis(mnesia_controller)], Pid = spawn_link(?MODULE, schema_coordinator, Args), receive {transaction_done, Res, Pid} -> Res; {EXIT, Pid, R} -> {aborted, {transaction_crashed, R}} end; _ -> {aborted, nested_transaction} end.启动一个 schema 事务的协调者 schema_coordinator 进程。schema_coordinator(Client, Fun, Controller) when is_pid(Controller) -> link(Controller), unlink(Client), Res = mnesia:transaction(Fun), Client ! {transaction_done, Res, self()}, unlink(Controller), % Avoids spurious exit message unlink(whereis(mnesia_tm)), % Avoids spurious exit message exit(normal).与普通事务不同, schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程,而是 mnesia_controller 进程。启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。2. 事务操作do_change_table_majority(schema, _Majority) -> mnesia:abort({bad_type, schema});do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
  25. 25. 可以看出,不能修改 schema 表的 majority 属性。对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁get_tid_ts_and_lock(Tab, Intent) -> TidTs = get(mnesia_activity_state), case TidTs of {_Mod, Tid, Ts} when is_record(Ts, tidstore)-> Store = Ts#tidstore.store, case Intent of read -> mnesia_locker:rlock_table(Tid, Store, Tab); write -> mnesia_locker:wlock_table(Tid, Store, Tab); none -> ignore end, TidTs; _ -> mnesia:abort(no_transaction) end.上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).关注实际的 majority 属性的修改动作:make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} -> FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority}
  26. 26. end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的schema 节点。通过 incr_version 更新表的版本号。通过 ensure_active 检查所有表的副本节点是否存活, 即与副本节点进行表的全局视图确认。修改表的元数据版本号:incr_version(Cs) -> {{Major, Minor}, _} = Cs#cstruct.version, Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), V= case Nodes -- val({Cs#cstruct.name, active_replicas}) of [] -> {Major + 1, 0}; % All replicas are active _ -> {Major, Minor + 1} % Some replicas are inactive end, Cs#cstruct{version = {V, {node(), now()}}}.mnesia_lib.erlcs_to_nodes(Cs) -> Cs#cstruct.disc_only_copies ++ Cs#cstruct.disc_copies ++ Cs#cstruct.ram_copies.重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符,这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当于时空序列+单调递增序列。版本号的计算类似于 NDB。检查表的全局视图:
  27. 27. ensure_active(Cs) -> ensure_active(Cs, active_replicas).ensure_active(Cs, What) -> Tab = Cs#cstruct.name, W = {Tab, What}, ensure_non_empty(W), Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), case Nodes -- val(W) of [] -> ok; Ns -> Expl = "All replicas on diskfull nodes are not active yet", case val({Tab, local_content}) of true -> case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of {Replies, []} -> check_active(Replies, Expl, Tab); {_Replies, BadNs} -> mnesia:abort({not_active, Expl, Tab, BadNs}) end; false -> mnesia:abort({not_active, Expl, Tab, Ns}) end end.is_remote_member(Key) -> IsActive = lists:member(node(), val(Key)), {IsActive, node()}.为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点,却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态的不一致认知。make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} ->
  28. 28. FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority} end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项成员的类型,内容是否合乎要求。vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct组成的 proplist, OldMajority, Majority}]insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) -> do_insert_schema_ops(Ts#tidstore.store, SchemaIOps).do_insert_schema_ops(Store, [Head | Tail]) -> ?ets_insert(Store, Head), do_insert_schema_ops(Store, Tail);do_insert_schema_ops(_Store, []) -> ok.可以看到, 插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
  29. 29. op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过程。3. schema 事务提交接口mnesia_tm.erlt_commit(Type) -> {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.首先在操作重排时进行检查:arrange(Tid, Store, Type) -> %% The local node is always included Nodes = get_elements(nodes,Store), Recs = prep_recs(Nodes, []), Key = ?ets_first(Store), N = 0, Prep = case Type of async -> #prep{protocol = sym_trans, records = Recs};
  30. 30. sync -> #prep{protocol = sync_sym_trans, records = Recs} end, case catch do_arrange(Tid, Store, Key, Prep, N) of {EXIT, Reason} -> dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]), case Reason of {aborted, R} -> mnesia:abort(R); _ -> mnesia:abort(Reason) end; {New, Prepared} -> {New, Prepared#prep{records = reverse(Prepared#prep.records)}} end.Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。do_arrange(Tid, Store, {Tab, Key}, Prep, N) -> Oid = {Tab, Key}, Items = ?ets_lookup(Store, Oid), %% Store is a bag P2 = prepare_items(Tid, Tab, Key, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1);do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op -> Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag P2 = prepare_schema_items(Tid, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1);可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op,change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事务使用不同的提交协议:prepare_schema_items(Tid, Items, Prep) -> Types = [{N, schema_ops} || N <- val({current, db_nodes})], Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema), Prep#prep{protocol = asym_trans, records = Recs}.prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设置为 asym_trans。prepare_node(_Node, _Storage, Items, Rec, Kind) when Kind == schema, Rec#commit.schema_ops == [] -> Rec#commit{schema_ops = Items};t_commit(Type) ->
  31. 31. {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作, 有recover_coordinator 过程,restore_op 操作。4. schema 事务协议过程multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), ?ets_insert(Store, Pending), {WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
  32. 32. SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})), {Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []), ?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]), case Votes of do_commit -> case SchemaPrep of {_Modified, C = #commit{}, DumperMode} -> mnesia_log:log(C), % C is not a binary ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]), D3 = C#commit.decision, D4 = D3#decision{outcome = unclear}, mnesia_recover:log_decision(D4), ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]), tell_participants(Pids, {Tid, pre_commit}), rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []); {EXIT, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end; {do_abort, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end.事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下:1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存 副本节点数的一半,等于一半时亦不满足2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协 议类型为 asym_trans4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
  33. 33. 交过程 注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变 低5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于 change_table_majority,没有什么需要 prepare 的6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes7. 发起节点收到所有参与节点的同意提交消息8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有 什么需要 prepare 的9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志10. 发起节点记录第一阶段恢复日志 presume_abort;11. 发起节点记录第二阶段恢复日志 unclear12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交14. 参与节点记录第一阶段恢复日志 presume_abort15. 参与节点记录第二阶段恢复日志 unclear16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节 点,用于崩溃恢复过程18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
  34. 34. 日志 committed20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成 b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant 进程的 schema 操作的提交结果 b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点 返回 schema_commit22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源 b.参与节点 commit_participant 进程释放锁和事务资源5. 远程节点事务管理器第一阶段提交 prepare 响应参与节点事务管理器收到第一阶段提交的消息后:mnesia.erldoit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) ->… {From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} -> ?eval_debug_fun({?MODULE, doit_ask_commit}, [{tid, Tid}, {prot, Protocol}]), mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), Pid = case Protocol of asym_trans when node(Tid#tid.pid) /= node() -> Args = [tmpid(From), Tid, Commit, DiscNs, RamNs], spawn_link(?MODULE, commit_participant, Args); _ when node(Tid#tid.pid) /= node() -> %% *_sym_trans reply(From, {vote_yes, Tid}), nopid end, P = #participant{tid = Tid,
  35. 35. pid = Pid, commit = Commit, disc_nodes = DiscNs, ram_nodes = RamNs, protocol = Protocol}, State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)}, doit_loop(State2);…创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘节点列表,内存节点列表],辅助事务提交:commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) -> process_flag(trap_exit, true), Commit = binary_to_term(Bin), commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs);commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) -> process_flag(trap_exit, true), commit_participant(Coord, Tid, C, C, DiscNs, RamNs).commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} ->、 case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), …参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作:mnesia_schema.erlprepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional};
  36. 36. OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), InitBy = schema_prepare, GoodRes = {Modified, Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode}, case DumperMode of optional -> dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]); mandatory -> case mnesia_controller:sync_dump_log(InitBy) of dumped -> GoodRes; {error, Reason} -> mnesia:abort(Reason) end end, case Ops of [] -> ignore; _ -> mnesia_controller:wait_for_schema_commit_lock() end, GoodRes end.注意此处,包含三个主要分支:1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容, optional},这适用于 majority 类表的操作2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running, sync_trans , create_table , delete_table , add_table_copy , del_table_copy , change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能 但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync 日志,这适用于出现上述操作的时候3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true, 原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority 就属于这类操作
  37. 37. 6. 远程节点事务参与者第二阶段提交 precommit 响应commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear,并返回 acc_pre_commit7. 请求节点事务发起者收到第二阶段提交 precommit 确认发起节点收到所有参与节点的 acc_pre_commit 消息后:rec_acc_pre_commit([], Tid, Store, {Commit,OrigC}, Res, DumperMode, GoodPids,SchemaAckPids) -> D = Commit#commit.decision, case Res of do_commit ->
  38. 38. prepare_sync_schema_commit(Store, SchemaAckPids), tell_participants(GoodPids, {Tid, committed}), D2 = D#decision{outcome = committed}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]), do_commit(Tid, Commit, DumperMode), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]), sync_schema_commit(Tid, Store, SchemaAckPids), mnesia_locker:release_tid(Tid), ?MODULE ! {delete_transaction, Tid}; {do_abort, Reason} -> tell_participants(GoodPids, {Tid, {do_abort, Reason}}), D2 = D#decision{outcome = aborted}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]), do_abort(Tid, OrigC), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}]) end, Res.prepare_sync_schema_commit(_Store, []) -> ok;prepare_sync_schema_commit(Store, [Pid | Pids]) -> ?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}), prepare_sync_schema_commit(Store, Pids).发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待:sync_schema_commit(_Tid, _Store, []) -> ok;sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.
  39. 39. 8. 远程节点事务参与者第三阶段提交 commit 响应参与节点 commit_participant 进程收到提交消息后:commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回schema_commit,否则完成事务。9. 第三阶段提交 commit 的本地提交过程do_commit(Tid, C, DumperMode) -> mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode), R = do_snmp(Tid, C#commit.snmp),
  40. 40. R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R), R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2), R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3), mnesia_subscr:report_activity(Tid), R4.这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与参与节点中。对于 schema 表的更新包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更 新表的 where_to_wlock 属性2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct更新过程如下:mnesia_dumper.erlupdate(_Tid, [], _DumperMode) -> dumped;update(Tid, SchemaOps, DumperMode) -> UseDir = mnesia_monitor:use_dir(), Res = perform_update(Tid, SchemaOps, DumperMode, UseDir), mnesia_controller:release_schema_commit_lock(), Res.perform_update(_Tid, _SchemaOps, mandatory, true) -> InitBy = schema_update, ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), opt_dump_log(InitBy);perform_update(Tid, SchemaOps, _DumperMode, _UseDir) -> InitBy = fast_schema_update, InPlace = mnesia_monitor:get_env(dump_log_update_in_place), ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
  41. 41. mnesia_log:version()) of {EXIT, Reason} -> Error = {error, {"Schema update error", Reason}}, close_files(InPlace, Error, InitBy), fatal("Schema update error ~p ~p", [Reason, SchemaOps]); _ -> ?eval_debug_fun({?MODULE, post_dump}, [InitBy]), close_files(InPlace, ok, InitBy), ok end.insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) -> ok;insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), ok;insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver);insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" -> insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver), insert_op(Tid, Storage, Op, InPlace, InitBy).…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…对于 change_table_majority 操作,其本身的格式为:{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置mnesia_controller.erlchange_table_majority(Cs) -> W = fun() -> Tab = Cs#cstruct.name, set({Tab, majority}, Cs#cstruct.majority), update_where_to_wlock(Tab)
  42. 42. end, update(W).update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}).该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分mnesia_dumper.erl…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的表的 cstructmnesia_schema.erlinsert_cstruct(Tid, Cs, KeepWhereabouts) -> Tab = Cs#cstruct.name, TabDef = cs2list(Cs), Val = {schema, Tab, TabDef}, mnesia_checkpoint:tm_retain(Tid, schema, Tab, write), mnesia_subscr:report_table_event(schema, Tid, Val, write), Active = val({Tab, active_replicas}),
  43. 43. case KeepWhereabouts of true -> ignore; false when Active == [] -> clear_whereabouts(Tab); false -> ignore end, set({Tab, cstruct}, Cs), ?ets_insert(schema, Val), do_set_schema(Tab, Cs), Val.do_set_schema(Tab) -> List = get_create_list(Tab), Cs = list2cs(List), do_set_schema(Tab, Cs).do_set_schema(Tab, Cs) -> Type = Cs#cstruct.type, set({Tab, setorbag}, Type), set({Tab, local_content}, Cs#cstruct.local_content), set({Tab, ram_copies}, Cs#cstruct.ram_copies), set({Tab, disc_copies}, Cs#cstruct.disc_copies), set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies), set({Tab, load_order}, Cs#cstruct.load_order), set({Tab, access_mode}, Cs#cstruct.access_mode), set({Tab, majority}, Cs#cstruct.majority), set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)), set({Tab, snmp}, Cs#cstruct.snmp), set({Tab, user_properties}, Cs#cstruct.user_properties), [set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties], set({Tab, frag_properties}, Cs#cstruct.frag_properties), mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties), set({Tab, storage_properties}, Cs#cstruct.storage_properties), set({Tab, attributes}, Cs#cstruct.attributes), Arity = length(Cs#cstruct.attributes) + 1, set({Tab, arity}, Arity), RecName = Cs#cstruct.record_name, set({Tab, record_name}, RecName), set({Tab, record_validation}, {RecName, Arity, Type}), set({Tab, wild_pattern}, wild(RecName, Arity)), set({Tab, index}, Cs#cstruct.index), %% create actual index tabs later set({Tab, cookie}, Cs#cstruct.cookie), set({Tab, version}, Cs#cstruct.version), set({Tab, cstruct}, Cs), Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs), set({Tab, storage_type}, Storage),
  44. 44. mnesia_lib:add({schema, tables}, Tab), Ns = mnesia_lib:cs_to_nodes(Cs), case lists:member(node(), Ns) of true -> mnesia_lib:add({schema, local_tables}, Tab); false when Tab == schema -> mnesia_lib:add({schema, local_tables}, Tab); false -> ignore end.do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等mnesia_dumper.erlinsert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) -> case open_files(Tab, Storage, InPlace, InitBy) of true -> case Storage of disc_copies when Tab /= schema -> mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}), ok; _ -> dets_insert(Op,Tab,Key,Val) end; false -> ignore end.dets_insert(Op,Tab,Key,Val) -> case Op of write -> dets_updated(Tab,Key), ok = dets:insert(Tab, Val); … end.dets_updated(Tab,Key) -> case get(mnesia_dumper_dets) of undefined -> Empty = gb_trees:empty(),
  45. 45. Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty), put(mnesia_dumper_dets, Tree); Tree -> case gb_trees:lookup(Tab,Tree) of {value, cleared} -> ignore; {value, Set} -> T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree), put(mnesia_dumper_dets, T); none -> T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree), put(mnesia_dumper_dets, T) end end.更新 schema 的 dets 表中记录的表 cstruct。综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时有良好的崩溃恢复检测schema 表的变更包括对多处地方的更新,包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct5. majority 事务处理majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
  46. 46. 6. 恢复mnesia 的连接协商过程用于在启动时,结点间交互状态信息:整个协商包括如下过程:1. 节点发现,集群遍历2. 节点协议版本检查3. 节点 schema 合并4. 节点 decision 通告与合并5. 节点数据重新载入与合并1. 节点协议版本检查+节点 decision 通告与合并mnesia_recover.erlconnect_nodes(Ns) -> %%Ns 为要检查的节点 call({connect_nodes, Ns}).handle_call({connect_nodes, Ns}, From, State) -> %% Determine which nodes we should try to connect AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, %%开始版本协商 case mnesia_monitor:negotiate_protocol(Check) of busy -> %% monitor is disconnecting some nodes retry %% the req (to avoid deadlock). erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> %% No good noodes to connect to! %% We cant use reply here because this function can be
  47. 47. %% called from handle_info gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> %% GoodNodes 是协商通过的节点 %% Now we have agreed upon a protocol with some new nodes %% and we may use them when we recover transactions mnesia_lib:add_list(recover_nodes, GoodNodes), %%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, %%检查曾经是否与这些节点出现过分区 mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State} end;handle_cast({announce_all, Nodes}, State) -> announce_all(Nodes), {noreply, State};announce_all([]) -> ok;announce_all(ToNodes) -> Tid = trans_tid_serial(), announce(ToNodes, [{trans_tid,serial,Tid}], [], false).announce(ToNodes, [Head | Tail], Acc, ForceSend) -> Acc2 = arrange(ToNodes, Head, Acc, ForceSend), announce(ToNodes, Tail, Acc2, ForceSend);announce(_ToNodes, [], Acc, _ForceSend) -> send_decisions(Acc).send_decisions([{Node, Decisions} | Tail]) -> %%注意此处,decision 合并过程是一个异步过程 abcast([Node], {decisions, node(), Decisions}), send_decisions(Tail);send_decisions([]) ->
  48. 48. ok.遍历所有协商通过的节点,告知其本节点的历史事务 decision下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点handle_cast({decisions, Node, Decisions}, State) -> mnesia_lib:add(recover_nodes, Node), State2 = add_remote_decisions(Node, Decisions, State), {noreply, State2};接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) -> State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [C | Tail], State) when is_record(C, transient_decision) -> D = #decision{tid = C#transient_decision.tid, outcome = C#transient_decision.outcome, disc_nodes = [], ram_nodes = []}, State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) -> add_remote_decisions(Node, Tail, State);add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) -> %%对于发送节点传来的未决事务,接收节点需要继续询问其它节点 sync_trans_tid_serial(Serial), case State#state.unclear_decision of undefined -> ignored; D -> case lists:member(Node, D#decision.ram_nodes) of true -> ignore; false -> %%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询问该未决事务的真正结果 abcast([Node], {what_decision, node(), D}) end
  49. 49. end, add_remote_decisions(Node, Tail, State);add_remote_decisions(_Node, [], State) -> State.add_remote_decision(Node, NewD, State) -> Tid = NewD#decision.tid, OldD = decision(Tid), %%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo日志进行重构 D = merge_decisions(Node, OldD, NewD), %%记录合并结果 do_log_decision(D, false, undefined), Outcome = D#decision.outcome, if OldD == no_decision -> ignore; Outcome == unclear -> ignore; true -> case lists:member(node(), NewD#decision.disc_nodes) or lists:member(node(), NewD#decision.ram_nodes) of true -> %%向其它节点告知本节点的 decision 合并结果 tell_im_certain([Node], D); false -> ignore end end, case State#state.unclear_decision of U when U#decision.tid == Tid -> WaitFor = State#state.unclear_waitfor -- [Node], if Outcome == unclear, WaitFor == [] -> %% Everybody are uncertain, lets abort %%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交结果,此时决定终止事务 NewOutcome = aborted, CertainD = D#decision{outcome = NewOutcome,
  50. 50. disc_nodes = [], ram_nodes = []}, tell_im_certain(D#decision.disc_nodes, CertainD), tell_im_certain(D#decision.ram_nodes, CertainD), do_log_decision(CertainD, false, undefined), verbose("Decided to abort transaction ~p " "since everybody are uncertain ~p~n", [Tid, CertainD]), gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome /= unclear -> %%发送节点知道事务结果,通告事务结果 verbose("~p told us that transaction ~p was ~p~n", [Node, Tid, Outcome]), gen_server:reply(State#state.unclear_pid, {ok, Outcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome == unclear -> %%发送节点也不知道事务结果,此时继续等待 State#state{unclear_waitfor = WaitFor} end; _ -> State end.合并策略:merge_decisions(Node, D, NewD0) -> NewD = filter_aborted(NewD0), if D == no_decision, node() /= Node -> %% We did not know anything about this txn NewD#decision{disc_nodes = []}; D == no_decision -> NewD; is_record(D, decision) -> DiscNs = D#decision.disc_nodes -- ([node(), Node]), OldD = filter_aborted(D#decision{disc_nodes = DiscNs}), if
  51. 51. OldD#decision.outcome == unclear, NewD#decision.outcome == unclear -> D; OldD#decision.outcome == NewD#decision.outcome -> %% We have come to the same decision OldD; OldD#decision.outcome == committed, NewD#decision.outcome == aborted -> %%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发送节点中止事务,此时仍然选择中止事务 Msg = {inconsistent_database, bad_decision, Node}, mnesia_lib:report_system_event(Msg), OldD#decision{outcome = aborted}; OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; OldD#decision.outcome == committed, NewD#decision.outcome == unclear -> OldD#decision{outcome = committed}; OldD#decision.outcome == unclear, NewD#decision.outcome == committed -> OldD#decision{outcome = committed} end end.2. 节点发现,集群遍历mnesia_controller.erlmerge_schema() -> AllNodes = mnesia_lib:all_nodes(), %%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移 case try_merge_schema(AllNodes, [node()], fun default_merge/1) of ok -> %%合并 schema 成功后,将进行数据合并 schema_is_merged(); {aborted, {throw, Str}} when is_list(Str) -> fatal("Failed to merge schema: ~s~n", [Str]); Else -> fatal("Failed to merge schema: ~p~n", [Else]) end.
  52. 52. try_merge_schema(Nodes, Told0, UserFun) -> %%开始集群遍历,启动一个 schema 合并事务 case mnesia_schema:merge_schema(UserFun) of {atomic, not_merged} -> %% No more nodes that we need to merge the schema with %% Ensure we have told everybody that we are running case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of [] -> ok; Tell -> im_running(Tell, [node()]), ok end; {atomic, {merged, OldFriends, NewFriends}} -> %% Check if new nodes has been added to the schema Diff = mnesia_lib:all_nodes() -- [node() | Nodes], mnesia_recover:connect_nodes(Diff), %% Tell everybody to adopt orphan tables %%通知所有的集群节点,本节点启动,开始数据合并申请 im_running(OldFriends, NewFriends), im_running(NewFriends, OldFriends), Told = case lists:member(node(), NewFriends) of true -> Told0 ++ OldFriends; false -> Told0 ++ NewFriends end, try_merge_schema(Nodes, Told, UserFun); {atomic, {"Cannot get cstructs", Node, Reason}} -> dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]), timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); {aborted, {shutdown, _}} -> %% One of the nodes is going down timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); Other -> Other end.mnesia_schema.erlmerge_schema() -> schema_transaction(fun() -> do_merge_schema([]) end).merge_schema(UserFun) -> schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
  53. 53. 题操作包括:{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}{op, merge_schema, CstructList}这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。do_merge_schema(LockTabs0) -> %% 锁 schema 表 {_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write), LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0], [get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs], Connected = val(recover_nodes), Running = val({current, db_nodes}), Store = Ts#tidstore.store, %% Verify that all nodes are locked that might not be the %% case, if this trans where queued when new nodes where added. case Running -- ets:lookup_element(Store, nodes, 2) of [] -> ok; %% All known nodes are locked Miss -> %% Abort! We dont want the sideeffects below to be executed mnesia:abort({bad_commit, {missing_lock, Miss}}) end, %% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点; Running是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点; case Connected -- Running of %% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法) ,这个过程由某个节点发起, [Node | _] = OtherNodes -> %% Time for a schema merging party! mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]), [mnesia_locker:wlock_no_exist( Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes)) || {T,Ns} <- LockTabs], %% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1 case fetch_cstructs(Node) of {cstructs, Cstructs, RemoteRunning1} ->
  54. 54. LockedAlready = Running ++ [Node], %% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这些结点是否曾有过分区 {New, Old} = mnesia_recover:connect_nodes(RemoteRunning1), %% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存 Old活结点,来自于 recover_nodes RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1), If %% RemoteRunning = (New∪Old)∩RemoteRunning1 %% RemoteRunning≠RemoteRunning <=> %% New∪(Old∩RemoteRunning1) < RemoteRunning1 %%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集群)中有部分节点不能与本节点相连 RemoteRunning /= RemoteRunning1 -> mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n", [node(), RemoteRunning1 -- RemoteRunning]), mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning}); true -> ok end, NeedsLock = RemoteRunning -- LockedAlready, mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock), [mnesia_locker:wlock_no_exist(Tid, Store, T,mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs], NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready), {value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs), SchemaDef = cs2list(NeedsConversion, SchemaCs), %% Announce that Node is running %%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并 A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
  55. 55. do_insert_schema_ops(Store, A), %% Introduce remote tables to local node %%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提交成功后由 mnesia_dumper 执行生效 do_insert_schema_ops(Store, make_merge_schema(Node, NeedsConversion,Cstructs)), %% Introduce local tables to remote nodes Tabs = val({schema, tables}), Ops = [{op, merge_schema, get_create_list(T)} || T <- Tabs, not lists:keymember(T, #cstruct.name, Cstructs)], do_insert_schema_ops(Store, Ops), %%Ensure that the txn will be committed on all nodes %%向另一个可连接集群中的所有节点通告本节点正在加入集群 NewNodes = RemoteRunning -- Running, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs), {merged, Running, RemoteRunning}; {error, Reason} -> {"Cannot get cstructs", Node, Reason}; {badrpc, Reason} -> {"Cannot get cstructs", Node, {badrpc, Reason}} end; [] -> %% No more nodes to merge schema with not_merged end.announce_im_running([N | Ns], SchemaCs) -> %%与新的可连接集群的节点经过协商 {L1, L2} = mnesia_recover:connect_nodes([N]), case lists:member(N, L1) or lists:member(N, L2) of true -> %%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是立即生效的,而不会延迟到事务提交 mnesia_lib:add({current, db_nodes}, N), mnesia_controller:add_active_replica(schema, N, SchemaCs);

×