Your SlideShare is downloading. ×
  • Like
mnesia脑裂问题综述
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

mnesia脑裂问题综述

  • 881 views
Published

mnesia脑裂问题综述

mnesia脑裂问题综述

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
881
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
49
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 目录1. 现象与成因............................................................................................................................... 22. mnesia 运行机制 ...................................................................................................................... 33. 常见问题与注意事项............................................................................................................... 64. 源码分析................................................................................................................................... 8 1. mnesia:create_schema/1 的工作过程 ............................................................................. 8 1. 主体过程................................................................................................................... 8 2. 前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9 3. 后半部分 mnesia:start/0 做的工作 ....................................................................... 19 4. mnesia:change_table_majority/2 的工作过程 .............................................................. 23 1. 调用接口................................................................................................................. 23 2. 事务操作................................................................................................................. 24 3. schema 事务提交接口 ........................................................................................... 29 4. schema 事务协议过程 ........................................................................................... 31 5. 远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34 6. 远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37 7. 请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37 8. 远程节点事务参与者第三阶段提交 commit 响应............................................... 39 9. 第三阶段提交 commit 的本地提交过程 .............................................................. 39 5. majority 事务处理 .......................................................................................................... 45 6. 恢复................................................................................................................................. 46
  • 2. 1. 节点协议版本检查+节点 decision 通告与合并.................................................... 46 2. 节点发现,集群遍历 ............................................................................................. 51 3. 节点 schema 合并 .................................................................................................. 60 4. 节点数据合并部分 1,从远程节点加载表 .......................................................... 62 5. 节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66 6. 节点数据合并部分 2,表加载完成 ...................................................................... 71 7. 分区检测......................................................................................................................... 73 1. 锁过程中的同步检测 ............................................................................................. 73 2. 事务过程中的同步检测 ......................................................................................... 75 3. 节点 down 异步检测.............................................................................................. 80 4. 节点 up 异步检测................................................................................................... 92 8. 其它................................................................................................................................. 95分析代码版本为 erlang 版本 R15B03。1. 现象与成因现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态,分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活的分区中拉取数据,自身原先的数据丢失。原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
  • 3. 性)约束,一些分布式存储系统为保证可用性,放弃强一致转而追求最终一致。mnesia 也是最终一致的分布式数据库,在没有分区的时候,mnesia 为强一致的,而出现分区后,mnesia 仍然允许写入,因此将呈现不一致的状态。分区消除后,需要应用者处理不一致的状态。简单的恢复过程如重启被放弃的分区,令其重新从保留的分区拉取数据,复杂的恢复过程则需要编写数据订正程序,应用订正程序进行恢复。2. mnesia 运行机制mnesia 运行机制状态图,事务过程采用 majority 事务,即当大多数节点在集群中时,才允许写:mnesia 运行机制解释:
  • 4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型: a) 无锁无事务脏写,一阶段异步; b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务; c) 有锁同步事务,一阶段同步锁,两阶段同步事务; d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务; e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority 事务;2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商 工作: a) 节点发现; b) 节点协议版本协商; c) 节点 schema 合并; d) 节点事务 decision 合并; i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告 {inconsistent_database, bad_decision, Node},本节点事务结果改为 abort; ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为 abort,此时远程节点将进行修改和通报; iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节 点事务结果,远程节点进行修改; iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务 结果的节点启动,并按照其结果作为事务结果; v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
  • 5. vi. 事务 decision 并不真正影响实际的数据内容; e) 节点表数据合并: i. 若本节点为 master 节点,则本节点从磁盘加载表数据; ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据; iii. 若远程节点存活,则从远程节点拉取表数据; iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数 据; v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动 加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访 问; vi. 若表数据已经加载,则不会再从远程节点拉取表数据; vii. 从集群角度看: 1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图; 2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视 , 图; 3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图, 各个分区依旧保持分区状态;3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对 事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不 一致,此时将通告应用者一个 inconsistent_database 事件: a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
  • 6. Node}; b) 重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network, Node}; c) 运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit 事务,即通告{inconsistent_database, bad_decision, Node};3. 常见问题与注意事项此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完备一些,也不包含 schema 操作。fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。常见问题:1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保 持 fail_safe 状态? 若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority, 此时集群变为可写; 若没有任何其他节点启动,则本节点一致保持 fail_safe 状态;2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后, 在 minority 分区写入,此时 minority 如何进入 fail_safe 状态? mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机 将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。 而
  • 7. 双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态;3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A, 有什么效果?重启 B、C 有什么效果? 经过试验得出: a) 若重启 A,则在 A 中能正确发现 B、 写入的记录, C 这依赖于 A 启动时的协商过程, A 向 B、C 请求表数据; b) 若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协 商过程,B、C 向 A 请求表数据;注意事项:1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master 节点,由他来仲裁最终的数据结果,但这样也会引入单点问题;2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动, 时 一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件;3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅;4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行 事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能;5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点 表的恢复依据;6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报;7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
  • 8. 4. 源码分析主题包括:1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1 的工作过程;2. 此处使用 majority 事务进行解释, 必须参考 mnesia:change_table_majority/2 的工作过程, 且此过程是 schema 事务,可以更详细全面的理解 majority 事务;3. majority 事务处理将弱化 schema 事务模型,进行特定的解释;4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载;5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件;1. mnesia:create_schema/1 的工作过程1. 主体过程安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。schema 添加的过程本质上是一个两阶段提交过程:schema 变更发起节点1. 询问各个参与节点是否已经由 schema 副本2. 上全局锁{mnesia_table_lock, schema}3. 在各个参与节点上建立 mnesia_fallback 进程4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保 存新生成的 schema 文件备份
  • 9. 5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创 建真正的"FALLBACK.BUP"文件6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更2. 前半部分 mnesia:create_schema/1 做的工作mnesia.erlcreate_schema(Ns) -> mnesia_bup:create_schema(Ns).mnesia_bup.erlcreate_schema([]) -> create_schema([node()]);create_schema(Ns) when is_list(Ns) -> case is_set(Ns) of true -> create_schema(Ns, mnesia_schema:ensure_no_schema(Ns)); false -> {error, {combine_error, Ns}} end;create_schema(Ns) -> {error, {badarg, Ns}}.mnesia_schema.erlensure_no_schema([H|T]) when is_atom(H) -> case rpc:call(H, ?MODULE, remote_read_schema, []) of {badrpc, Reason} -> {H, {"All nodes not running", H, Reason}}; {ok,Source, _} when Source /= default -> {H, {already_exists, H}}; _ -> ensure_no_schema(T) end;ensure_no_schema([H|_]) -> {error,{badarg, H}};ensure_no_schema([]) -> ok.remote_read_schema() -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok ->
  • 10. case mnesia_monitor:get_env(schema_location) of opt_disc -> read_schema(false); _ -> read_schema(false) end; {error, Reason} -> {error, Reason} end.询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。回到 mnesia_bup.erlmnesia_bup.erlcreate_schema(Ns, ok) -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok -> case mnesia_monitor:get_env(schema_location) of ram -> {error, {has_no_disc, node()}}; _ -> case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of {error, What} -> {error, What}; ok -> Mod = mnesia_backup, Str = mk_str(), File = mnesia_lib:dir(Str), file:delete(File), case catch make_initial_backup(Ns, File, Mod) of {ok, _Res} -> case do_install_fallback(File, Mod) of ok -> file:delete(File), ok; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end
  • 11. end end; {error, Reason} -> {error, Reason} end;create_schema(_Ns, {error, Reason}) -> {error, Reason};create_schema(_Ns, Reason) -> {error, Reason}.通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema:make_initial_backup(Ns, Opaque, Mod) -> Orig = mnesia_schema:get_initial_schema(disc_copies, Ns), Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)), Schema = [{schema, schema, Modded}], O2 = do_apply(Mod, open_write, [Opaque], Opaque), O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2), O4 = do_apply(Mod, write, [O3, Schema], O3), O5 = do_apply(Mod, commit_write, [O4], O4), {ok, O5}.创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份中。mnesia_schema.erlget_initial_schema(SchemaStorage, Nodes) -> Cs = #cstruct{name = schema, record_name = schema, attributes = [table, cstruct]}, Cs2 = case SchemaStorage of ram_copies -> Cs#cstruct{ram_copies = Nodes}; disc_copies -> Cs#cstruct{disc_copies = Nodes} end, cs2list(Cs2).mnesia_bup.erldo_install_fallback(Opaque, Mod) when is_atom(Mod) -> do_install_fallback(Opaque, [{module, Mod}]);do_install_fallback(Opaque, Args) when is_list(Args) -> case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
  • 12. {ok, FA} -> do_install_fallback(FA); {error, Reason} -> {error, Reason} end;do_install_fallback(_Opaque, Args) -> {error, {badarg, Args}}.检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在check_fallback_arg_type/2 中,然后进行安装check_fallback_args([Arg | Tail], FA) -> case catch check_fallback_arg_type(Arg, FA) of {EXIT, _Reason} -> {error, {badarg, Arg}}; FA2 -> check_fallback_args(Tail, FA2) end;check_fallback_args([], FA) -> {ok, FA}.check_fallback_arg_type(Arg, FA) -> case Arg of {scope, global} -> FA#fallback_args{scope = global}; {scope, local} -> FA#fallback_args{scope = local}; {module, Mod} -> Mod2 = mnesia_monitor:do_check_type(backup_module, Mod), FA#fallback_args{module = Mod2}; {mnesia_dir, Dir} -> FA#fallback_args{mnesia_dir = Dir, use_default_dir = false}; {keep_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{keep_tables = Tabs}; {skip_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{skip_tables = Tabs}; {default_op, keep_tables} -> FA#fallback_args{default_op = keep_tables}; {default_op, skip_tables} -> FA#fallback_args{default_op = skip_tables} end.
  • 13. 此处的构造过程记录 module 参数, mnesia_backup, 为 同时记录 opaque 参数, 为新建 schema文件的文件名。do_install_fallback(FA) -> Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]), Res = receive {EXIT, Pid, Reason} -> % if appl has trapped exit {error, {EXIT, Reason}}; {Pid, Res2} -> case Res2 of {ok, _} -> ok; {error, Reason} -> {error, {"Cannot install fallback", Reason}} end end, Res.install_fallback_master(ClientPid, FA) -> process_flag(trap_exit, true), State = {start, FA}, Opaque = FA#fallback_args.opaque, Mod = FA#fallback_args.module, Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)), unlink(ClientPid), ClientPid ! {self(), Res}, exit(shutdown).从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup,Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。fallback_args 默认定义:-record(fallback_args, {opaque, scope = global, module = mnesia_monitor:get_env(backup_module), use_default_dir = true, mnesia_dir, fallback_bup, fallback_tmp, skip_tables = [],
  • 14. keep_tables = [], default_op = keep_tables }).iterate(Mod, Fun, Opaque, Acc) -> R = #restore{bup_module = Mod, bup_data = Opaque}, case catch read_schema_section(R) of {error, Reason} -> {error, Reason}; {R2, {Header, Schema, Rest}} -> case catch iter(R2, Header, Schema, Fun, Acc, Rest) of {ok, R3, Res} -> catch safe_apply(R3, close_read, [R3#restore.bup_data]), {ok, Res}; {error, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, Reason}; {EXIT, Pid, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Pid, Reason}}; {EXIT, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {EXIT, Reason}} end end.iter(R, Header, Schema, Fun, Acc, []) -> case safe_apply(R, read, [R#restore.bup_data]) of {R2, []} -> Res = Fun([], Header, Schema, Acc), {ok, R2, Res}; {R2, BupItems} -> iter(R2, Header, Schema, Fun, Acc, BupItems) end;iter(R, Header, Schema, Fun, Acc, BupItems) -> Acc2 = Fun(BupItems, Header, Schema, Acc), iter(R, Header, Schema, Fun, Acc2, []).read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数:restore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2),
  • 15. case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}}); Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;一个典型的 schema 结构如下:[{schema,schema, [{name,schema}, {type,set}, {ram_copies,[]}, {disc_copies,[rds_la_dev@10.232.64.77]}, {disc_only_copies,[]}, {load_order,0}, {access_mode,read_write}, {index,[]}, {snmp,[]}, {local_content,false}, {record_name,schema}, {attributes,[table,cstruct]}, {user_properties,[]}, {frag_properties,[]}, {cookie,{{1358,676768,107058},rds_la_dev@10.232.64.77}}, {version,{{2,0},[]}}]}]构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将CreateList 还原回 schema 的 cstruct 结构。mnesia_bup.erlrestore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2), case catch mnesia_schema:list2cs(CreateList) of {EXIT, Reason} -> throw({error, {"Bad schema in restore_recs", Reason}});
  • 16. Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end;get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema构建的节点。构建过程要加入集群的全局锁{mnesia_table_lock, schema}。在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程, 处理 schema 的变更。向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程:restore_recs([], _Header, _Schema, Pids) -> send_fallback(Pids, swap), send_fallback(Pids, stop), stop;restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。进入 fallback_receiver 进程的处理过程:fallback_receiver(Master, FA) -> process_flag(trap_exit, true), case catch register(mnesia_fallback, self()) of {EXIT, _} -> Reason = {already_exists, node()}, local_fallback_error(Master, Reason); true -> FA2 = check_fallback_dir(Master, FA), Bup = FA2#fallback_args.fallback_bup, case mnesia_lib:exists(Bup) of
  • 17. true -> Reason2 = {already_exists, node()}, local_fallback_error(Master, Reason2); false -> Mod = mnesia_backup, Tmp = FA2#fallback_args.fallback_tmp, R = #restore{mode = replace, bup_module = Mod, bup_data = Tmp}, file:delete(Tmp), case catch fallback_receiver_loop(Master, R, FA2, schema) of {error, Reason} -> local_fallback_error(Master, Reason); Other -> exit(Other) end end end.在自身的节点上注册进程名字为 mnesia_fallback。构建初始化状态。进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。fallback_receiver_loop(Master, R, FA, State) -> receive {Master, {start, Header, Schema}} when State =:= schema -> Dir = FA#fallback_args.mnesia_dir, throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)), R2 = safe_apply(R, open_write, [R#restore.bup_data]), R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]), BupSchema = [schema2bup(S) || S <- Schema], R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]), Master ! {self(), ok}, fallback_receiver_loop(Master, R4, FA, records); … end.在本地也创建一个 schema 临时文件, 接收来自变更发起节点构建的 header 部分和新 schema。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
  • 18. safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup, Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); … end.mnesia_backup.erlcommit_write(OpaqueData) -> B = OpaqueData, case disk_log:sync(B#backup.file_desc) of ok -> case disk_log:close(B#backup.file_desc) of ok -> case file:rename(B#backup.tmp_file, B#backup.file) of ok -> {ok, B#backup.file}; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end.变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名为真正的新建的 schema 文件名,消除最后的".BUPTMP"fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []), safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup,
  • 19. Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); …end.在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的active_fallback 属性,表明称为一个活动 fallback 节点。fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, stop} when State =:= stop -> stopped; … end.收到 stop 消息后,mnesia_fallback 进程退出。3. 后半部分 mnesia:start/0 做的工作mnesia 启 动 , 则 可 以 自 动 通 过 事 务 管 理 器 mnesia_tm 调 用mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中:mnesia_bup.erltm_fallback_start(IgnoreFallback) -> mnesia_schema:lock_schema(), Res = do_fallback_start(fallback_exists(), IgnoreFallback), mnesia_schema: unlock_schema(), case Res of ok -> ok; {error, Reason} -> exit(Reason) end.锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表锁
  • 20. do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of … end.根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);init_dat_files(Schema, LocalTabs) -> TmpFile = mnesia_lib:tab2tmp(schema), Args = [{file, TmpFile}, {keypos, 2}, {type, set}], case dets:open_file(schema, Args) of % Assume schema lock {ok, _} -> create_dat_files(Schema, LocalTabs), ok = dets:close(schema), LocalTab = #local_tab{ name = schema, storage_type = disc_copies, open = undefined, add = undefined, close = undefined, swap = undefined, record_name = schema, opened = false}, ?ets_insert(LocalTabs, LocalTab); {error, Reason} -> throw({error, {"Cannot open file", schema, Args, Reason}}) end.创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数据恢复到新建的 schema 的 dets 表中。
  • 21. 调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后调用之,将其它表的元数据持久化到 schema 表中。restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State);构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) -> Tab = element(1, Rec), case ?ets_lookup(LocalTabs, Tab) of [] -> State = {not_local, LocalTabs, Tab}, restore_tables(Recs, Header, Schema, State); [LT] when is_record(LT, local_tab) -> State = {local, LocalTabs, LT}, case LT#local_tab.opened of true -> ignore; false -> (LT#local_tab.open)(Tab, LT), ?ets_insert(LocalTabs,LT#local_tab{opened=true}) end, restore_tables(All, Header, Schema, State) end;打开表,不断检查表是否位于本地,若是则进行恢复添加过程:restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) -> Tab = element(1, Rec), if Tab =:= LT#local_tab.name -> Key = element(2, Rec), (LT#local_tab.add)(Tab, Key, Rec, LT), restore_tables(Recs, Header, Schema, State); true -> NewState = {new, LocalTabs}, restore_tables(All, Header, Schema, NewState) end;Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
  • 22. 待所有表恢复完成后,进行真正的提交工作:do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of {ok, _Res} -> catch dets:close(schema), TmpSchema = mnesia_lib:tab2tmp(schema), DatSchema = mnesia_lib:tab2dat(schema), AllLT = ?ets_match_object(LocalTabs, _), ?ets_delete_table(LocalTabs), case file:rename(TmpSchema, DatSchema) of ok -> [(LT#local_tab.swap)(LT#local_tab.name, LT) || LT <- AllLT, LT#local_tab.name =/= schema], file:delete(BupFile), ok; {error, Reason} -> file:delete(TmpSchema), {error, {"Cannot start from fallback. Rename error.", Reason}} end; {error, Reason} -> {error, {"Cannot start from fallback", Reason}}; {EXIT, Reason} -> {error, {"Cannot start from fallback", Reason}} end.将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema:mnesia_schema.erl
  • 23. init(IgnoreFallback) -> Res = read_schema(true, IgnoreFallback), {ok, Source, _CreateList} = exit_on_error(Res), verbose("Schema initiated from: ~p~n", [Source]), set({schema, tables}, []), set({schema, local_tables}, []), Tabs = set_schema(?ets_first(schema)), lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs), set({schema, where_to_read}, node()), set({schema, load_node}, node()), set({schema, load_reason}, initial), mnesia_controller:add_active_replica(schema, node()).检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始信息,并将本节点作为 schema 表的初始活动副本若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须同时包含该节点。4. mnesia:change_table_majority/2 的工作过程mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia:change_table_majority/2 更改此属性。该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。1. 调用接口mnesia.erlchange_table_majority(T, M) -> mnesia_schema:change_table_majority(T, M).
  • 24. mnesia_schema.erlchange_table_majority(Tab, Majority) when is_boolean(Majority) -> schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end).schema_transaction(Fun) -> case get(mnesia_activity_state) of undefined -> Args = [self(), Fun, whereis(mnesia_controller)], Pid = spawn_link(?MODULE, schema_coordinator, Args), receive {transaction_done, Res, Pid} -> Res; {EXIT, Pid, R} -> {aborted, {transaction_crashed, R}} end; _ -> {aborted, nested_transaction} end.启动一个 schema 事务的协调者 schema_coordinator 进程。schema_coordinator(Client, Fun, Controller) when is_pid(Controller) -> link(Controller), unlink(Client), Res = mnesia:transaction(Fun), Client ! {transaction_done, Res, self()}, unlink(Controller), % Avoids spurious exit message unlink(whereis(mnesia_tm)), % Avoids spurious exit message exit(normal).与普通事务不同, schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程,而是 mnesia_controller 进程。启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。2. 事务操作do_change_table_majority(schema, _Majority) -> mnesia:abort({bad_type, schema});do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
  • 25. 可以看出,不能修改 schema 表的 majority 属性。对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁get_tid_ts_and_lock(Tab, Intent) -> TidTs = get(mnesia_activity_state), case TidTs of {_Mod, Tid, Ts} when is_record(Ts, tidstore)-> Store = Ts#tidstore.store, case Intent of read -> mnesia_locker:rlock_table(Tid, Store, Tab); write -> mnesia_locker:wlock_table(Tid, Store, Tab); none -> ignore end, TidTs; _ -> mnesia:abort(no_transaction) end.上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).关注实际的 majority 属性的修改动作:make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} -> FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority}
  • 26. end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的schema 节点。通过 incr_version 更新表的版本号。通过 ensure_active 检查所有表的副本节点是否存活, 即与副本节点进行表的全局视图确认。修改表的元数据版本号:incr_version(Cs) -> {{Major, Minor}, _} = Cs#cstruct.version, Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), V= case Nodes -- val({Cs#cstruct.name, active_replicas}) of [] -> {Major + 1, 0}; % All replicas are active _ -> {Major, Minor + 1} % Some replicas are inactive end, Cs#cstruct{version = {V, {node(), now()}}}.mnesia_lib.erlcs_to_nodes(Cs) -> Cs#cstruct.disc_only_copies ++ Cs#cstruct.disc_copies ++ Cs#cstruct.ram_copies.重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符,这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当于时空序列+单调递增序列。版本号的计算类似于 NDB。检查表的全局视图:
  • 27. ensure_active(Cs) -> ensure_active(Cs, active_replicas).ensure_active(Cs, What) -> Tab = Cs#cstruct.name, W = {Tab, What}, ensure_non_empty(W), Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), case Nodes -- val(W) of [] -> ok; Ns -> Expl = "All replicas on diskfull nodes are not active yet", case val({Tab, local_content}) of true -> case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of {Replies, []} -> check_active(Replies, Expl, Tab); {_Replies, BadNs} -> mnesia:abort({not_active, Expl, Tab, BadNs}) end; false -> mnesia:abort({not_active, Expl, Tab, Ns}) end end.is_remote_member(Key) -> IsActive = lists:member(node(), val(Key)), {IsActive, node()}.为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点,却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态的不一致认知。make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} ->
  • 28. FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority} end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项成员的类型,内容是否合乎要求。vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct组成的 proplist, OldMajority, Majority}]insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) -> do_insert_schema_ops(Ts#tidstore.store, SchemaIOps).do_insert_schema_ops(Store, [Head | Tail]) -> ?ets_insert(Store, Head), do_insert_schema_ops(Store, Tail);do_insert_schema_ops(_Store, []) -> ok.可以看到, 插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
  • 29. op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过程。3. schema 事务提交接口mnesia_tm.erlt_commit(Type) -> {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.首先在操作重排时进行检查:arrange(Tid, Store, Type) -> %% The local node is always included Nodes = get_elements(nodes,Store), Recs = prep_recs(Nodes, []), Key = ?ets_first(Store), N = 0, Prep = case Type of async -> #prep{protocol = sym_trans, records = Recs};
  • 30. sync -> #prep{protocol = sync_sym_trans, records = Recs} end, case catch do_arrange(Tid, Store, Key, Prep, N) of {EXIT, Reason} -> dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]), case Reason of {aborted, R} -> mnesia:abort(R); _ -> mnesia:abort(Reason) end; {New, Prepared} -> {New, Prepared#prep{records = reverse(Prepared#prep.records)}} end.Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。do_arrange(Tid, Store, {Tab, Key}, Prep, N) -> Oid = {Tab, Key}, Items = ?ets_lookup(Store, Oid), %% Store is a bag P2 = prepare_items(Tid, Tab, Key, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1);do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op -> Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag P2 = prepare_schema_items(Tid, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1);可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op,change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事务使用不同的提交协议:prepare_schema_items(Tid, Items, Prep) -> Types = [{N, schema_ops} || N <- val({current, db_nodes})], Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema), Prep#prep{protocol = asym_trans, records = Recs}.prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设置为 asym_trans。prepare_node(_Node, _Storage, Items, Rec, Kind) when Kind == schema, Rec#commit.schema_ops == [] -> Rec#commit{schema_ops = Items};t_commit(Type) ->
  • 31. {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end.提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作, 有recover_coordinator 过程,restore_op 操作。4. schema 事务协议过程multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), ?ets_insert(Store, Pending), {WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
  • 32. SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})), {Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []), ?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]), case Votes of do_commit -> case SchemaPrep of {_Modified, C = #commit{}, DumperMode} -> mnesia_log:log(C), % C is not a binary ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]), D3 = C#commit.decision, D4 = D3#decision{outcome = unclear}, mnesia_recover:log_decision(D4), ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]), tell_participants(Pids, {Tid, pre_commit}), rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []); {EXIT, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end; {do_abort, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end.事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下:1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存 副本节点数的一半,等于一半时亦不满足2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协 议类型为 asym_trans4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
  • 33. 交过程 注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变 低5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于 change_table_majority,没有什么需要 prepare 的6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes7. 发起节点收到所有参与节点的同意提交消息8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有 什么需要 prepare 的9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志10. 发起节点记录第一阶段恢复日志 presume_abort;11. 发起节点记录第二阶段恢复日志 unclear12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交14. 参与节点记录第一阶段恢复日志 presume_abort15. 参与节点记录第二阶段恢复日志 unclear16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节 点,用于崩溃恢复过程18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
  • 34. 日志 committed20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成 b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant 进程的 schema 操作的提交结果 b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点 返回 schema_commit22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源 b.参与节点 commit_participant 进程释放锁和事务资源5. 远程节点事务管理器第一阶段提交 prepare 响应参与节点事务管理器收到第一阶段提交的消息后:mnesia.erldoit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) ->… {From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} -> ?eval_debug_fun({?MODULE, doit_ask_commit}, [{tid, Tid}, {prot, Protocol}]), mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), Pid = case Protocol of asym_trans when node(Tid#tid.pid) /= node() -> Args = [tmpid(From), Tid, Commit, DiscNs, RamNs], spawn_link(?MODULE, commit_participant, Args); _ when node(Tid#tid.pid) /= node() -> %% *_sym_trans reply(From, {vote_yes, Tid}), nopid end, P = #participant{tid = Tid,
  • 35. pid = Pid, commit = Commit, disc_nodes = DiscNs, ram_nodes = RamNs, protocol = Protocol}, State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)}, doit_loop(State2);…创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘节点列表,内存节点列表],辅助事务提交:commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) -> process_flag(trap_exit, true), Commit = binary_to_term(Bin), commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs);commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) -> process_flag(trap_exit, true), commit_participant(Coord, Tid, C, C, DiscNs, RamNs).commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} ->、 case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), …参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作:mnesia_schema.erlprepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional};
  • 36. OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), InitBy = schema_prepare, GoodRes = {Modified, Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode}, case DumperMode of optional -> dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]); mandatory -> case mnesia_controller:sync_dump_log(InitBy) of dumped -> GoodRes; {error, Reason} -> mnesia:abort(Reason) end end, case Ops of [] -> ignore; _ -> mnesia_controller:wait_for_schema_commit_lock() end, GoodRes end.注意此处,包含三个主要分支:1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容, optional},这适用于 majority 类表的操作2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running, sync_trans , create_table , delete_table , add_table_copy , del_table_copy , change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能 但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync 日志,这适用于出现上述操作的时候3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true, 原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority 就属于这类操作
  • 37. 6. 远程节点事务参与者第二阶段提交 precommit 响应commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear,并返回 acc_pre_commit7. 请求节点事务发起者收到第二阶段提交 precommit 确认发起节点收到所有参与节点的 acc_pre_commit 消息后:rec_acc_pre_commit([], Tid, Store, {Commit,OrigC}, Res, DumperMode, GoodPids,SchemaAckPids) -> D = Commit#commit.decision, case Res of do_commit ->
  • 38. prepare_sync_schema_commit(Store, SchemaAckPids), tell_participants(GoodPids, {Tid, committed}), D2 = D#decision{outcome = committed}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]), do_commit(Tid, Commit, DumperMode), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]), sync_schema_commit(Tid, Store, SchemaAckPids), mnesia_locker:release_tid(Tid), ?MODULE ! {delete_transaction, Tid}; {do_abort, Reason} -> tell_participants(GoodPids, {Tid, {do_abort, Reason}}), D2 = D#decision{outcome = aborted}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]), do_abort(Tid, OrigC), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}]) end, Res.prepare_sync_schema_commit(_Store, []) -> ok;prepare_sync_schema_commit(Store, [Pid | Pids]) -> ?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}), prepare_sync_schema_commit(Store, Pids).发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待:sync_schema_commit(_Tid, _Store, []) -> ok;sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.
  • 39. 8. 远程节点事务参与者第三阶段提交 commit 响应参与节点 commit_participant 进程收到提交消息后:commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->… receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end;…参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回schema_commit,否则完成事务。9. 第三阶段提交 commit 的本地提交过程do_commit(Tid, C, DumperMode) -> mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode), R = do_snmp(Tid, C#commit.snmp),
  • 40. R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R), R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2), R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3), mnesia_subscr:report_activity(Tid), R4.这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与参与节点中。对于 schema 表的更新包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更 新表的 where_to_wlock 属性2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct更新过程如下:mnesia_dumper.erlupdate(_Tid, [], _DumperMode) -> dumped;update(Tid, SchemaOps, DumperMode) -> UseDir = mnesia_monitor:use_dir(), Res = perform_update(Tid, SchemaOps, DumperMode, UseDir), mnesia_controller:release_schema_commit_lock(), Res.perform_update(_Tid, _SchemaOps, mandatory, true) -> InitBy = schema_update, ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), opt_dump_log(InitBy);perform_update(Tid, SchemaOps, _DumperMode, _UseDir) -> InitBy = fast_schema_update, InPlace = mnesia_monitor:get_env(dump_log_update_in_place), ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
  • 41. mnesia_log:version()) of {EXIT, Reason} -> Error = {error, {"Schema update error", Reason}}, close_files(InPlace, Error, InitBy), fatal("Schema update error ~p ~p", [Reason, SchemaOps]); _ -> ?eval_debug_fun({?MODULE, post_dump}, [InitBy]), close_files(InPlace, ok, InitBy), ok end.insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) -> ok;insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), ok;insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver);insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" -> insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver), insert_op(Tid, Storage, Op, InPlace, InitBy).…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…对于 change_table_majority 操作,其本身的格式为:{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置mnesia_controller.erlchange_table_majority(Cs) -> W = fun() -> Tab = Cs#cstruct.name, set({Tab, majority}, Cs#cstruct.majority), update_where_to_wlock(Tab)
  • 42. end, update(W).update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}).该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分mnesia_dumper.erl…insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy);…insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的表的 cstructmnesia_schema.erlinsert_cstruct(Tid, Cs, KeepWhereabouts) -> Tab = Cs#cstruct.name, TabDef = cs2list(Cs), Val = {schema, Tab, TabDef}, mnesia_checkpoint:tm_retain(Tid, schema, Tab, write), mnesia_subscr:report_table_event(schema, Tid, Val, write), Active = val({Tab, active_replicas}),
  • 43. case KeepWhereabouts of true -> ignore; false when Active == [] -> clear_whereabouts(Tab); false -> ignore end, set({Tab, cstruct}, Cs), ?ets_insert(schema, Val), do_set_schema(Tab, Cs), Val.do_set_schema(Tab) -> List = get_create_list(Tab), Cs = list2cs(List), do_set_schema(Tab, Cs).do_set_schema(Tab, Cs) -> Type = Cs#cstruct.type, set({Tab, setorbag}, Type), set({Tab, local_content}, Cs#cstruct.local_content), set({Tab, ram_copies}, Cs#cstruct.ram_copies), set({Tab, disc_copies}, Cs#cstruct.disc_copies), set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies), set({Tab, load_order}, Cs#cstruct.load_order), set({Tab, access_mode}, Cs#cstruct.access_mode), set({Tab, majority}, Cs#cstruct.majority), set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)), set({Tab, snmp}, Cs#cstruct.snmp), set({Tab, user_properties}, Cs#cstruct.user_properties), [set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties], set({Tab, frag_properties}, Cs#cstruct.frag_properties), mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties), set({Tab, storage_properties}, Cs#cstruct.storage_properties), set({Tab, attributes}, Cs#cstruct.attributes), Arity = length(Cs#cstruct.attributes) + 1, set({Tab, arity}, Arity), RecName = Cs#cstruct.record_name, set({Tab, record_name}, RecName), set({Tab, record_validation}, {RecName, Arity, Type}), set({Tab, wild_pattern}, wild(RecName, Arity)), set({Tab, index}, Cs#cstruct.index), %% create actual index tabs later set({Tab, cookie}, Cs#cstruct.cookie), set({Tab, version}, Cs#cstruct.version), set({Tab, cstruct}, Cs), Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs), set({Tab, storage_type}, Storage),
  • 44. mnesia_lib:add({schema, tables}, Tab), Ns = mnesia_lib:cs_to_nodes(Cs), case lists:member(node(), Ns) of true -> mnesia_lib:add({schema, local_tables}, Tab); false when Tab == schema -> mnesia_lib:add({schema, local_tables}, Tab); false -> ignore end.do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等mnesia_dumper.erlinsert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) -> case open_files(Tab, Storage, InPlace, InitBy) of true -> case Storage of disc_copies when Tab /= schema -> mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}), ok; _ -> dets_insert(Op,Tab,Key,Val) end; false -> ignore end.dets_insert(Op,Tab,Key,Val) -> case Op of write -> dets_updated(Tab,Key), ok = dets:insert(Tab, Val); … end.dets_updated(Tab,Key) -> case get(mnesia_dumper_dets) of undefined -> Empty = gb_trees:empty(),
  • 45. Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty), put(mnesia_dumper_dets, Tree); Tree -> case gb_trees:lookup(Tab,Tree) of {value, cleared} -> ignore; {value, Set} -> T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree), put(mnesia_dumper_dets, T); none -> T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree), put(mnesia_dumper_dets, T) end end.更新 schema 的 dets 表中记录的表 cstruct。综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时有良好的崩溃恢复检测schema 表的变更包括对多处地方的更新,包括:1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性3. 在 schema 的 ets 表中,记录表的 cstruct4. 在 schema 的 dets 表中,记录表的 cstruct5. majority 事务处理majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
  • 46. 6. 恢复mnesia 的连接协商过程用于在启动时,结点间交互状态信息:整个协商包括如下过程:1. 节点发现,集群遍历2. 节点协议版本检查3. 节点 schema 合并4. 节点 decision 通告与合并5. 节点数据重新载入与合并1. 节点协议版本检查+节点 decision 通告与合并mnesia_recover.erlconnect_nodes(Ns) -> %%Ns 为要检查的节点 call({connect_nodes, Ns}).handle_call({connect_nodes, Ns}, From, State) -> %% Determine which nodes we should try to connect AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, %%开始版本协商 case mnesia_monitor:negotiate_protocol(Check) of busy -> %% monitor is disconnecting some nodes retry %% the req (to avoid deadlock). erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> %% No good noodes to connect to! %% We cant use reply here because this function can be
  • 47. %% called from handle_info gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> %% GoodNodes 是协商通过的节点 %% Now we have agreed upon a protocol with some new nodes %% and we may use them when we recover transactions mnesia_lib:add_list(recover_nodes, GoodNodes), %%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, %%检查曾经是否与这些节点出现过分区 mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State} end;handle_cast({announce_all, Nodes}, State) -> announce_all(Nodes), {noreply, State};announce_all([]) -> ok;announce_all(ToNodes) -> Tid = trans_tid_serial(), announce(ToNodes, [{trans_tid,serial,Tid}], [], false).announce(ToNodes, [Head | Tail], Acc, ForceSend) -> Acc2 = arrange(ToNodes, Head, Acc, ForceSend), announce(ToNodes, Tail, Acc2, ForceSend);announce(_ToNodes, [], Acc, _ForceSend) -> send_decisions(Acc).send_decisions([{Node, Decisions} | Tail]) -> %%注意此处,decision 合并过程是一个异步过程 abcast([Node], {decisions, node(), Decisions}), send_decisions(Tail);send_decisions([]) ->
  • 48. ok.遍历所有协商通过的节点,告知其本节点的历史事务 decision下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点handle_cast({decisions, Node, Decisions}, State) -> mnesia_lib:add(recover_nodes, Node), State2 = add_remote_decisions(Node, Decisions, State), {noreply, State2};接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) -> State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [C | Tail], State) when is_record(C, transient_decision) -> D = #decision{tid = C#transient_decision.tid, outcome = C#transient_decision.outcome, disc_nodes = [], ram_nodes = []}, State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2);add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) -> add_remote_decisions(Node, Tail, State);add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) -> %%对于发送节点传来的未决事务,接收节点需要继续询问其它节点 sync_trans_tid_serial(Serial), case State#state.unclear_decision of undefined -> ignored; D -> case lists:member(Node, D#decision.ram_nodes) of true -> ignore; false -> %%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询问该未决事务的真正结果 abcast([Node], {what_decision, node(), D}) end
  • 49. end, add_remote_decisions(Node, Tail, State);add_remote_decisions(_Node, [], State) -> State.add_remote_decision(Node, NewD, State) -> Tid = NewD#decision.tid, OldD = decision(Tid), %%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo日志进行重构 D = merge_decisions(Node, OldD, NewD), %%记录合并结果 do_log_decision(D, false, undefined), Outcome = D#decision.outcome, if OldD == no_decision -> ignore; Outcome == unclear -> ignore; true -> case lists:member(node(), NewD#decision.disc_nodes) or lists:member(node(), NewD#decision.ram_nodes) of true -> %%向其它节点告知本节点的 decision 合并结果 tell_im_certain([Node], D); false -> ignore end end, case State#state.unclear_decision of U when U#decision.tid == Tid -> WaitFor = State#state.unclear_waitfor -- [Node], if Outcome == unclear, WaitFor == [] -> %% Everybody are uncertain, lets abort %%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交结果,此时决定终止事务 NewOutcome = aborted, CertainD = D#decision{outcome = NewOutcome,
  • 50. disc_nodes = [], ram_nodes = []}, tell_im_certain(D#decision.disc_nodes, CertainD), tell_im_certain(D#decision.ram_nodes, CertainD), do_log_decision(CertainD, false, undefined), verbose("Decided to abort transaction ~p " "since everybody are uncertain ~p~n", [Tid, CertainD]), gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome /= unclear -> %%发送节点知道事务结果,通告事务结果 verbose("~p told us that transaction ~p was ~p~n", [Node, Tid, Outcome]), gen_server:reply(State#state.unclear_pid, {ok, Outcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome == unclear -> %%发送节点也不知道事务结果,此时继续等待 State#state{unclear_waitfor = WaitFor} end; _ -> State end.合并策略:merge_decisions(Node, D, NewD0) -> NewD = filter_aborted(NewD0), if D == no_decision, node() /= Node -> %% We did not know anything about this txn NewD#decision{disc_nodes = []}; D == no_decision -> NewD; is_record(D, decision) -> DiscNs = D#decision.disc_nodes -- ([node(), Node]), OldD = filter_aborted(D#decision{disc_nodes = DiscNs}), if
  • 51. OldD#decision.outcome == unclear, NewD#decision.outcome == unclear -> D; OldD#decision.outcome == NewD#decision.outcome -> %% We have come to the same decision OldD; OldD#decision.outcome == committed, NewD#decision.outcome == aborted -> %%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发送节点中止事务,此时仍然选择中止事务 Msg = {inconsistent_database, bad_decision, Node}, mnesia_lib:report_system_event(Msg), OldD#decision{outcome = aborted}; OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; OldD#decision.outcome == committed, NewD#decision.outcome == unclear -> OldD#decision{outcome = committed}; OldD#decision.outcome == unclear, NewD#decision.outcome == committed -> OldD#decision{outcome = committed} end end.2. 节点发现,集群遍历mnesia_controller.erlmerge_schema() -> AllNodes = mnesia_lib:all_nodes(), %%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移 case try_merge_schema(AllNodes, [node()], fun default_merge/1) of ok -> %%合并 schema 成功后,将进行数据合并 schema_is_merged(); {aborted, {throw, Str}} when is_list(Str) -> fatal("Failed to merge schema: ~s~n", [Str]); Else -> fatal("Failed to merge schema: ~p~n", [Else]) end.
  • 52. try_merge_schema(Nodes, Told0, UserFun) -> %%开始集群遍历,启动一个 schema 合并事务 case mnesia_schema:merge_schema(UserFun) of {atomic, not_merged} -> %% No more nodes that we need to merge the schema with %% Ensure we have told everybody that we are running case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of [] -> ok; Tell -> im_running(Tell, [node()]), ok end; {atomic, {merged, OldFriends, NewFriends}} -> %% Check if new nodes has been added to the schema Diff = mnesia_lib:all_nodes() -- [node() | Nodes], mnesia_recover:connect_nodes(Diff), %% Tell everybody to adopt orphan tables %%通知所有的集群节点,本节点启动,开始数据合并申请 im_running(OldFriends, NewFriends), im_running(NewFriends, OldFriends), Told = case lists:member(node(), NewFriends) of true -> Told0 ++ OldFriends; false -> Told0 ++ NewFriends end, try_merge_schema(Nodes, Told, UserFun); {atomic, {"Cannot get cstructs", Node, Reason}} -> dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]), timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); {aborted, {shutdown, _}} -> %% One of the nodes is going down timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); Other -> Other end.mnesia_schema.erlmerge_schema() -> schema_transaction(fun() -> do_merge_schema([]) end).merge_schema(UserFun) -> schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
  • 53. 题操作包括:{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}{op, merge_schema, CstructList}这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。do_merge_schema(LockTabs0) -> %% 锁 schema 表 {_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write), LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0], [get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs], Connected = val(recover_nodes), Running = val({current, db_nodes}), Store = Ts#tidstore.store, %% Verify that all nodes are locked that might not be the %% case, if this trans where queued when new nodes where added. case Running -- ets:lookup_element(Store, nodes, 2) of [] -> ok; %% All known nodes are locked Miss -> %% Abort! We dont want the sideeffects below to be executed mnesia:abort({bad_commit, {missing_lock, Miss}}) end, %% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点; Running是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点; case Connected -- Running of %% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法) ,这个过程由某个节点发起, [Node | _] = OtherNodes -> %% Time for a schema merging party! mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]), [mnesia_locker:wlock_no_exist( Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes)) || {T,Ns} <- LockTabs], %% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1 case fetch_cstructs(Node) of {cstructs, Cstructs, RemoteRunning1} ->
  • 54. LockedAlready = Running ++ [Node], %% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这些结点是否曾有过分区 {New, Old} = mnesia_recover:connect_nodes(RemoteRunning1), %% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存 Old活结点,来自于 recover_nodes RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1), If %% RemoteRunning = (New∪Old)∩RemoteRunning1 %% RemoteRunning≠RemoteRunning <=> %% New∪(Old∩RemoteRunning1) < RemoteRunning1 %%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集群)中有部分节点不能与本节点相连 RemoteRunning /= RemoteRunning1 -> mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n", [node(), RemoteRunning1 -- RemoteRunning]), mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning}); true -> ok end, NeedsLock = RemoteRunning -- LockedAlready, mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock), [mnesia_locker:wlock_no_exist(Tid, Store, T,mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs], NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready), {value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs), SchemaDef = cs2list(NeedsConversion, SchemaCs), %% Announce that Node is running %%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并 A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
  • 55. do_insert_schema_ops(Store, A), %% Introduce remote tables to local node %%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提交成功后由 mnesia_dumper 执行生效 do_insert_schema_ops(Store, make_merge_schema(Node, NeedsConversion,Cstructs)), %% Introduce local tables to remote nodes Tabs = val({schema, tables}), Ops = [{op, merge_schema, get_create_list(T)} || T <- Tabs, not lists:keymember(T, #cstruct.name, Cstructs)], do_insert_schema_ops(Store, Ops), %%Ensure that the txn will be committed on all nodes %%向另一个可连接集群中的所有节点通告本节点正在加入集群 NewNodes = RemoteRunning -- Running, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs), {merged, Running, RemoteRunning}; {error, Reason} -> {"Cannot get cstructs", Node, Reason}; {badrpc, Reason} -> {"Cannot get cstructs", Node, {badrpc, Reason}} end; [] -> %% No more nodes to merge schema with not_merged end.announce_im_running([N | Ns], SchemaCs) -> %%与新的可连接集群的节点经过协商 {L1, L2} = mnesia_recover:connect_nodes([N]), case lists:member(N, L1) or lists:member(N, L2) of true -> %%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是立即生效的,而不会延迟到事务提交 mnesia_lib:add({current, db_nodes}, N), mnesia_controller:add_active_replica(schema, N, SchemaCs);
  • 56. false -> %%若协商未通过,则中止事务,此时会通过 announce_im_running 的 undo 动作,将新加入的事务节点全部剥离 mnesia_lib:error("Mnesia on ~p could not connect to node ~p~n", [node(), N]), mnesia:abort({node_not_running, N}) end, announce_im_running(Ns, SchemaCs);announce_im_running([], _) -> [].schema 操作在三阶段提交时,mnesia_tm 首先要进行 prepare:mnesia_tm.erlmulti_commit(asym_trans, Majority, Tid, CR, Store) ->… SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),…mnesia_schema.erlprepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional}; OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), … end.prepare_ops(Tid, [Op | Ops], WaitFor, Changed, Acc, DumperMode) -> case prepare_op(Tid, Op, WaitFor) of … {false, optional} -> prepare_ops(Tid, Ops, WaitFor, true, Acc, DumperMode) end;prepare_ops(_Tid, [], _WaitFor, Changed, Acc, DumperMode) -> {Changed, Acc, DumperMode}.prepare_op(_Tid, {op, announce_im_running, Node, SchemaDef, Running, RemoteRunning},_WaitFor) -> SchemaCs = list2cs(SchemaDef), if Node == node() -> %% Announce has already run on local node
  • 57. ignore; %% from do_merge_schema true -> %% If a node has restarted it may still linger in db_nodes, %% but have been removed from recover_nodes Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]), NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs) end, {false, optional};此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协商,协商通过后,这些未连接节点将加入本节点的事务节点集群反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:mnesia_tm.erlcommit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} -> %% If we can not find any local unclear decision %% we should presume abort at startup recovery case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), receive {Tid, pre_commit} -> … receive {Tid, committed} -> … {Tid, {do_abort, _Reason}} ->
  • 58. … mnesia_schema:undo_prepare_commit(Tid, C0), … {EXIT, _, _} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … end; {Tid, {do_abort, Reason}} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … {EXIT, _, Reason} -> … mnesia_schema:undo_prepare_commit(Tid, C0), … end; {EXIT, Reason} -> … mnesia_schema:undo_prepare_commit(Tid, C0) end, ….mnesia_schema.erlundo_prepare_commit(Tid, Commit) -> case Commit#commit.schema_ops of [] -> ignore; Ops -> %% Catch to allow failure mnesia_controller may not be started catch mnesia_controller:release_schema_commit_lock(), undo_prepare_ops(Tid, Ops) end, Commit.undo_prepare_ops(Tid, [Op | Ops]) -> case element(1, Op) of TheOp when TheOp /= op, TheOp /= restore_op -> undo_prepare_ops(Tid, Ops); _ -> undo_prepare_ops(Tid, Ops), undo_prepare_op(Tid, Op) end;undo_prepare_ops(_Tid, []) -> [].undo_prepare_op(_Tid, {op, announce_im_running, _Node, _, _Running, _RemoteRunning}) ->
  • 59. case ?catch_val(prepare_op) of {announce_im_running, New} -> unannounce_im_running(New); _Else -> ok end;unannounce_im_running([N | Ns]) -> mnesia_lib:del({current, db_nodes}, N), mnesia_controller:del_active_replica(schema, N), unannounce_im_running(Ns);unannounce_im_running([]) -> ok.由此可见集群发现与合并事务节点加入:mnesia_controller.erladd_active_replica(Tab, Node, Storage, AccessMode) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), case AccessMode of read_write -> New = lists:sort([{Node, Storage} | Del]), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:add_lsort({Tab, where_to_write}, Node); read_only -> set(Var, mark_blocked_tab(Blocked, Del)), mnesia_lib:del({Tab, where_to_write}, Node) end, update_where_to_wlock(Tab), add({Tab, active_replicas}, Node).事务节点删除:mnesia_controller.erldel_active_replica(Tab, Node) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), New = lists:sort(Del), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:del({Tab, active_replicas}, Node),
  • 60. mnesia_lib:del({Tab, where_to_write}, Node), update_where_to_wlock(Tab).3. 节点 schema 合并schema 操作构造,该过程较长,此处仅总结:schema 表合并:1. 本节点与远程节点的 schema 表的 cookie 不同, 且二者有不同的 master 或都没有 master, 此时不能合并;2. 本节点与远程节点的 schema 表的存储类型不同,且二者都是 disc_copies,此时不能合 并;普通表合并:1. 本节点与远程节点的普通表的 cookie 不同,且二者有不同的 master 或都没有 master, 此时不能合并;在可以合并的情况下,需要合并表的 cstruct,storage_type,version:1. 合并 storage_type 时,disc_copies 与 ram_copies 优先选择 disc_copies,disc_only_copies 与 disc_copies 优先选择 disc_only_copies,而 ram_copies 与 disc_only_copies 不兼容;2. 合并 version 时,要求表的主要定义属性必须相同,同时选择主、次版本号较大的一方;由此可见,schema 在大多数情况下还是很容易合并成功的。schema 真正的写入过程在提交阶段 do_commit 中进行:mnesia_dumper.erlinsert_op(Tid, _, {op, merge_schema, TabDef}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case Cs#cstruct.name of schema -> Update = fun(NS = {Node,Storage}) ->
  • 61. case mnesia_lib:cs_to_storage_type(Node, Cs) of Storage -> NS; disc_copies when Node == node() -> Dir = mnesia_lib:dir(), ok = mnesia_schema:opt_create_dir(true, Dir), mnesia_schema:purge_dir(Dir, []), mnesia_log:purge_all_logs(), mnesia_lib:set(use_dir, true), mnesia_log:init(), Ns = val({current, db_nodes}), F = fun(U) -> mnesia_recover:log_mnesia_up(U) end, lists:foreach(F, Ns), raw_named_dump_table(schema, dat), temp_set_master_nodes(), {Node,disc_copies}; CSstorage -> {Node,CSstorage} end end, W2C0 = val({schema, where_to_commit}), W2C = case W2C0 of {blocked, List} -> {blocked,lists:map(Update,List)}; List -> lists:map(Update,List) end, if W2C == W2C0 -> ignore; true -> mnesia_lib:set({schema, where_to_commit}, W2C) end; _ -> ignore end, insert_cstruct(Tid, Cs, false, InPlace, InitBy);insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab.分别在 ets 表 mnesia_gvar、内存 schema 表、磁盘 schema 表中记录新的表 cstruct 及其相关信息。
  • 62. 4. 节点数据合并部分 1,从远程节点加载表mnesia_controller 在 merge_schema->try_merge_schema 中调用 im_running 两次,一次通知存活节点,告知本节点及本节点发现的新节点,另一次通知新节点(新节点包括自身),告知本节点及本节点集群内的其它节点,这样便联通了旧有集群和新集群。存活节点收到 im_running 消息后,立即通知新节点(包括发出 im_running 请求的节点)一个 adopt_orphans 消息,令其接收自己的表。由于本节点广播 im_running 消息,也会有多个adopt_orphans 消息递送到本节点,本节点将从第一个到达的 adopt_orphans 消息的来源节点处取得表数据。通常情况下,若一次仅有本节点启动,则这个动作将被简化为,本节点通知向所有的存活节点通告 im_running,然后一个最先返回的存活节点将向本节点传递表数据。mnesia_controller.erlim_running(OldFriends, NewFriends) -> abcast(OldFriends, {im_running, node(), NewFriends}).handle_cast({im_running, Node, NewFriends}, State) -> LocalTabs = mnesia_lib:local_active_tables() -- [schema], RemoveLocalOnly = fun(Tab) -> not val({Tab, local_content}) end, Tabs = lists:filter(RemoveLocalOnly, LocalTabs), Nodes = mnesia_lib:union([Node],val({current, db_nodes})), Ns = mnesia_lib:intersect(NewFriends, Nodes), %%通知远程节点,与本节点进行数据交互 abcast(Ns, {adopt_orphans, node(), Tabs}), noreply(State);本节点通知远程节点,与本节点进行数据交互。注意此时 decision 合并过程也同时在进行中。远程节点向本节点返回 adopt_orphans 消息。
  • 63. 本节点收到一个远程节点发送的 adopt_orphans 消息后,将开始从这个节点处取得数据:handle_cast({adopt_orphans, Node, Tabs}, State) -> %%node_has_tabs 将抢先一步从当前存活的节点处拉取数据,若这一步能取得表数据,则之后不会再从本地磁盘中取数据,为了保持数据一致性,最好选择设置 master 节点,全局数据与 master 保持一致。 %%本节点将远程节点加入表的活动副本中,并开始异步取得数据 State2 = node_has_tabs(Tabs, Node, State), case ?catch_val({node_up,Node}) of true -> ignore; _ -> %% Register the other node as up and running %%标识远程节点 up,并产生 mnesia_up 事件 set({node_up, Node}, true), mnesia_recover:log_mnesia_up(Node), verbose("Logging mnesia_up ~w~n",[Node]), mnesia_lib:report_system_event({mnesia_up, Node}), %% Load orphan tables LocalTabs = val({schema, local_tables}) -- [schema], Nodes = val({current, db_nodes}), %%若未设置 master,则 RemoteMasters 为[],若无 local 表,则 LocalOrphans 为[]。 %%若有 local 表,则从磁盘加载这些 local 表 {LocalOrphans, RemoteMasters} = orphan_tables(LocalTabs, Node, Nodes, [], []), Reason = {adopt_orphan, node()}, mnesia_late_loader:async_late_disc_load(node(), LocalOrphans, Reason), Fun = fun(N) -> RemoteOrphans = [Tab || {Tab, Ns} <- RemoteMasters, lists:member(N, Ns)], mnesia_late_loader:maybe_async_late_disc_load(N, RemoteOrphans, Reason) end, lists:foreach(Fun, Nodes) end, noreply(State2);
  • 64. node_has_tabs([Tab | Tabs], Node, State) when Node /= node() -> State2 = case catch update_whereabouts(Tab, Node, State) of State1 = #state{} -> State1; {EXIT, R} -> %% Tab was just deleted? case ?catch_val({Tab, cstruct}) of {EXIT, _} -> State; % yes _ -> erlang:error(R) end end, node_has_tabs(Tabs, Node, State2);update_whereabouts(Tab, Node, State) -> Storage = val({Tab, storage_type}), Read = val({Tab, where_to_read}), LocalC = val({Tab, local_content}), BeingCreated = (?catch_val({Tab, create_table}) == true), Masters = mnesia_recover:get_master_nodes(Tab), ByForce = val({Tab, load_by_force}), GoGetIt = if ByForce == true -> true; Masters == [] -> true; true -> lists:member(Node, Masters) end, if … %%启动时,有多个副本的表,其 where_to_read 首先设置为 nowhere,此处触发表的远程节点加载过程: Read == nowhere -> add_active_replica(Tab, Node), case GoGetIt of true -> %%产生一个#net_load{}任务,通过 opt_start_loader->load_and_reply 启动一个load_table_fun 来处理这个任务。 Worker = #net_load{table = Tab, reason = {active_remote, Node}}, add_worker(Worker, State); false -> State
  • 65. end; … end.load_table_fun(#net_load{cstruct=Cs, table=Tab, reason=Reason, opt_reply_to=ReplyTo}) -> LocalC = val({Tab, local_content}), AccessMode = val({Tab, access_mode}), ReadNode = val({Tab, where_to_read}), Active = filter_active(Tab), Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = (ReplyTo /= undefined), reply_to = ReplyTo, reply = {loaded, ok} }, if ReadNode == node() -> %% Already loaded locally fun() -> Done end; LocalC == true -> fun() -> Res = mnesia_loader:disc_load_table(Tab, load_local_content), Done#loader_done{reply = Res, needs_announce = true, needs_sync = true} end; AccessMode == read_only, Reason /= {dumper,add_table_copy} -> fun() -> disc_load_table(Tab, Reason, ReplyTo) end; true -> fun() -> %% Either we cannot read the table yet %% or someone is moving a replica between %% two nodes %%加载过程为创建表对应的 ets 表,然后从远程节点的发送进程逐条读取记录到本节点的接收进程,有本地接收进程将记录重新插入 ets 表。 Res = mnesia_loader:net_load_table(Tab, Reason, Active, Cs), case Res of {loaded, ok} -> Done#loader_done{needs_sync = true, reply = Res}; {not_loaded, _} -> Done#loader_done{is_loaded = false,
  • 66. reply = Res} end end end;5. 节点数据合并部分 2,从本地磁盘加载表由于 try_merge_schema 将首先从存活的节点处取得数据,本地节点将优先保持与远程节点的数据一致,相比于本地节点,mnesia 倾向于选择存活的节点的数据副本,因为他们可能保持着最新的内容。但若没有任何远程存活节点(如集群整体关闭,本节点为第一个启动节点),则此时考虑从本地磁盘加载数据。主要场景:1. 若节点不是一个已关闭集群中第一个启动的节点,若为 master,则加载所有表,若非 master,则仅能加载 local 表。2. 若节点是一个已关闭集群中第一个启动的节点,若: a) 这个节点是最后一个关闭的(由节点 decision 表中的 mnesia_down 历史记录确定), 则节点将从本地磁盘加载表; b) 这个节点不是最后一个关闭的,则节点不从本地磁盘加载表,同时等待其他远程节 点启动,并通知本节点 adopt_orphans 消息,才从远程节点处加载表,远程节点未 返回前,表在本节点不可见(表定义可以从 mnesia:schema/0 中获取,但表对应的 ets 表还没有创建,因此表不可见)schema_is_merged() -> MsgTag = schema_is_merged, %%根据 mnesia_down 的历史记录,确认本节点是否为集群最后一个关闭的节点或表的master 节点,若是,则确定将表从本地加载:
  • 67. SafeLoads = initial_safe_loads(), try_schedule_late_disc_load(SafeLoads, initial, MsgTag).initial_safe_loads() -> case val({schema, storage_type}) of ram_copies -> Downs = [], Tabs = val({schema, local_tables}) -- [schema], LastC = fun(T) -> last_consistent_replica(T, Downs) end, lists:zf(LastC, Tabs); disc_copies -> Downs = mnesia_recover:get_mnesia_downs(), dbg_out("mnesia_downs = ~p~n", [Downs]), Tabs = val({schema, local_tables}) -- [schema], LastC = fun(T) -> last_consistent_replica(T, Downs) end, lists:zf(LastC, Tabs) end.last_consistent_replica(Tab, Downs) -> Cs = val({Tab, cstruct}), Storage = mnesia_lib:cs_to_storage_type(node(), Cs), Ram = Cs#cstruct.ram_copies, Disc = Cs#cstruct.disc_copies, DiscOnly = Cs#cstruct.disc_only_copies, BetterCopies0 = mnesia_lib:remote_copy_holders(Cs) -- Downs, BetterCopies = BetterCopies0 -- Ram, AccessMode = Cs#cstruct.access_mode, Copies = mnesia_lib:copy_holders(Cs), Masters = mnesia_recover:get_master_nodes(Tab), LocalMaster0 = lists:member(node(), Masters), LocalContent = Cs#cstruct.local_content, RemoteMaster = if Masters == [] -> false; true -> not LocalMaster0 end, LocalMaster = if Masters == [] -> false; true -> LocalMaster0 end, if Copies == [node()] ->
  • 68. %% Only one copy holder and it is local. %% It may also be a local contents table {true, {Tab, local_only}}; LocalContent == true -> {true, {Tab, local_content}}; LocalMaster == true -> %% We have a local master {true, {Tab, local_master}}; RemoteMaster == true -> %% Wait for remote master copy false; Storage == ram_copies -> if Disc == [], DiscOnly == [] -> %% Nobody has copy on disc {true, {Tab, ram_only}}; true -> %% Some other node has copy on disc false end; AccessMode == read_only -> %% No one has been able to update the table, %% i.e. all disc resident copies are equal {true, {Tab, read_only}}; BetterCopies /= [], Masters /= [node()] -> %% There are better copies on other nodes %% and we do not have the only master copy false; true -> {true, {Tab, initial}} end.try_schedule_late_disc_load(Tabs, _Reason, MsgTag) when Tabs == [], MsgTag /= schema_is_merged -> ignore;try_schedule_late_disc_load(Tabs, Reason, MsgTag) -> %%通过一个 mnesia 事务来进行表加载过程 GetIntents = fun() -> %%上一个全局磁盘表加载锁 mnesia_late_disc_load Item = mnesia_late_disc_load, Nodes = val({current, db_nodes}),
  • 69. mnesia:lock({global, Item, Nodes}, write), %%询问其它远程节点,它们是否正在加载或已经加载了这些表,若正在加载或已经加载,则本节点不会从磁盘加载,而是等待远程节点产生 adopt_orphans 消息,告知本节点远程加载表。 case multicall(Nodes -- [node()], disc_load_intents) of {Replies, []} -> %%等待表加载完成: %%MsgTag = schema_is_merged call({MsgTag, Tabs, Reason, Replies}), done; {_, BadNodes} -> %% Some nodes did not respond, lets try again {retry, BadNodes} end end, case mnesia:transaction(GetIntents) of {atomic, done} -> done; {atomic, {retry, BadNodes}} -> verbose("Retry late_load_tables because bad nodes: ~p~n", [BadNodes]), try_schedule_late_disc_load(Tabs, Reason, MsgTag); {aborted, AbortReason} -> fatal("Cannot late_load_tables~p: ~p~n", [[Tabs, Reason, MsgTag], AbortReason]) end.handle_call({schema_is_merged, TabsR, Reason, RemoteLoaders}, From, State) -> %% 产 生 一 个 #disc_load{} 任 务 , 通 过 opt_start_loader->load_and_reply 启 动 一 个load_table_fun 来处理这个任务。 State2 = late_disc_load(TabsR, Reason, RemoteLoaders, From, State), Msgs = State2#state.early_msgs, State3 = State2#state{early_msgs = [], schema_is_merged = true}, handle_early_msgs(lists:reverse(Msgs), State3);load_table_fun(#disc_load{table=Tab, reason=Reason, opt_reply_to=ReplyTo}) -> ReadNode = val({Tab, where_to_read}), Active = filter_active(Tab),
  • 70. Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = false }, if Active == [], ReadNode == nowhere -> %% Not loaded anywhere, lets load it from disc fun() -> disc_load_table(Tab, Reason, ReplyTo) end; ReadNode == nowhere -> %% Already loaded on other node, lets get it Cs = val({Tab, cstruct}), fun() -> case mnesia_loader:net_load_table(Tab, Reason, Active, Cs) of {loaded, ok} -> Done#loader_done{needs_sync = true}; {not_loaded, storage_unknown} -> Done#loader_done{is_loaded = false}; {not_loaded, ErrReason} -> Done#loader_done{is_loaded = false, reply = {not_loaded,ErrReason}} end end; true -> %% Already readable, do not worry be happy fun() -> Done end end.disc_load_table(Tab, Reason, ReplyTo) -> Done = #loader_done{is_loaded = true, table_name = Tab, needs_announce = false, needs_sync = false, needs_reply = ReplyTo /= undefined, reply_to = ReplyTo, reply = {loaded, ok} }, %%加载过程为从表的磁盘数据文件 Table.DCT 中取得数据(erlang 的 term),并从日志文件 Table.DCL(若有)中取得 redo 日志,合并到数据的 ets 表中。 Res = mnesia_loader:disc_load_table(Tab, Reason), if Res == {loaded, ok} ->
  • 71. Done#loader_done{needs_announce = true, needs_sync = true, reply = Res}; ReplyTo /= undefined -> Done#loader_done{is_loaded = false, reply = Res}; true -> fatal("Cannot load table ~p from disc: ~p~n", [Tab, Res]) end.最终通过 mnesia_loader:disc_load_table 从磁盘加载表。由此可见,不管是从网络还是从磁盘加载表,最终都是通过 load_table_fun->mnesia_loader进行表加载,只是加载函数有区别,一个为 net_load_table,另一个为 disc_load_table。这两个函数的主要工作过程相似,都为建立表对应的 ets 表,然后从远程节点/磁盘取回记录插入 ets 表。6. 节点数据合并部分 2,表加载完成表加载完成后,load_table_fun 向 mnesia_controller 返回一个#loader_done{}:handle_info(Done = #loader_done{worker_pid=WPid, table_name=Tab}, State0) -> LateQueue0 = State0#state.late_loader_queue, State1 = State0#state{loader_pid = lists:keydelete(WPid,1,get_loaders(State0))}, State2 = case Done#loader_done.is_loaded of true -> %% Optional table announcement if Done#loader_done.needs_announce == true, Done#loader_done.needs_reply == true -> i_have_tab(Tab), %% Should be {dumper,add_table_copy} only reply(Done#loader_done.reply_to, Done#loader_done.reply); Done#loader_done.needs_reply == true -> %% Should be {dumper,add_table_copy} only
  • 72. reply(Done#loader_done.reply_to, Done#loader_done.reply); Done#loader_done.needs_announce == true, Tab == schema -> i_have_tab(Tab); Done#loader_done.needs_announce == true -> i_have_tab(Tab), %% Local node needs to perform user_sync_tab/1 Ns = val({current, db_nodes}), abcast(Ns, {i_have_tab, Tab, node()}); Tab == schema -> ignore; true -> %% Local node needs to perform user_sync_tab/1 Ns = val({current, db_nodes}), AlreadyKnows = val({Tab, active_replicas}), %%表加载完成后,本节点会向其它节点发送一个 i_have_tab 消息,通知其他节点本节点持有最完整的表副本,其它节点进一步通过 node_has_tabs->update_whereabouts向本节点取得表数据,加入自己的副本中。 abcast(Ns -- AlreadyKnows, {i_have_tab, Tab, node()}) end, %% Optional user sync case Done#loader_done.needs_sync of true -> user_sync_tab(Tab); false -> ignore end, State1#state{late_loader_queue=gb_trees:delete_any(Tab, LateQueue0)}; false -> %% Either the node went down or table was not %% loaded remotly yet case Done#loader_done.needs_reply of true -> reply(Done#loader_done.reply_to, Done#loader_done.reply); false -> ignore end, case ?catch_val({Tab, active_replicas}) of [_|_] -> % still available elsewhere {value,{_,Worker}} = lists:keysearch(WPid,1,get_loaders(State0)), add_loader(Tab,Worker,State1); _ ->
  • 73. State1 end end, State3 = opt_start_worker(State2), noreply(State3);由此可见,表副本同步的过程包括 push 和 pull 两类方式:1. 节点启动时会主动从活动节点中 pull 数据过来,这一条线为:本节点-im_running->远程 节点-adopt_orphans->本节点,本节点通过 node_has_tabs 建立#net_load{}任务从远程节 点 pull 数据;2. 节点启动时加载表后,会主动向其它活动节点 push 本节点加载的表,这一条线为:本 节点合并 schema 完毕,决定从本地磁盘加载表,加载完成后,本节点-i_have_tab->远程 节点,远程节点通过 node_has_tabs 建立#net_load{}任务从本节点 pull 数据;7. 分区检测同步检测是指 mnesia 在加分布式锁或进行事务提交时,进行的分区检测,由于 mnesia 主动连接各个参与节点,因此这一步直接集成在了锁和事务协议中。对于 majority 类表来说,锁和事务协议交互中,若发现存活的可参与节点超过半数时,事务即可进行下去,否则不能。1. 锁过程中的同步检测对于锁协议:锁协议是一阶段同步协议。mnesia_locker.erl
  • 74. wlock(Tid, Store, Oid) -> wlock(Tid, Store, Oid, _CheckMajority = true).wlock(Tid, Store, Oid, CheckMajority) -> {Tab, Key} = Oid, case need_lock(Store, Tab, Key, write) of yes -> {Ns, Majority} = w_nodes(Tab), if CheckMajority -> check_majority(Majority, Tab, Ns); true -> ignore end, Op = {self(), {write, Tid, Oid}}, ?ets_insert(Store, {{locks, Tab, Key}, write}), get_wlocks_on_nodes(Ns, Ns, Store, Op, Oid); no when Key /= ?ALL, Tab /= ?GLOBAL -> []; no -> element(2, w_nodes(Tab)) end.w_nodes(Tab) -> case ?catch_val({Tab, where_to_wlock}) of {[_ | _], _} = Where -> Where; _ -> mnesia:abort({no_exists, Tab}) end.check_majority(true, Tab, HaveNs) -> check_majority(Tab, HaveNs);check_majority(false, _, _) -> ok.check_majority(Tab, HaveNs) -> case ?catch_val({Tab, majority}) of true -> case mnesia_lib:have_majority(Tab, HaveNs) of true -> ok; false -> mnesia:abort({no_majority, Tab}) end; _ -> ok
  • 75. end.可以看出上写锁时,需要根据表的 where_to_wlock 属性,确定是否需要进行 majority 检查,where_to_wlock 属性是一个动态属性,当有节点加入退出时,该属性也随之更改。这个检查过程与 schema 表操作对 majority 的检查相同,均为超过半数时才同意。2. 事务过程中的同步检测对于事务协议:事务协议是二阶段同步一阶段异步协议,由于其提交过程已经在前面叙述过,这里仅列出其majority 检查的过程:multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, …可以看出提交事务时,需要根据表的 where_to_commit 属性,确定是否需要进行 majority检查,where_to_commit 属性是一个动态属性,当有节点加入退出时,该属性也随之更改。锁和事务阶段将 majority 检查作为前置条件不变式进行检查,可以快速的确定事务是否可以进行。对锁和事务协议进行不变式分析:1. 确定锁协议的参与节点为表的 where_to_wlock 属性;2. 若锁协议 majority 检查不通过,则记录无法上锁,事务退出;
  • 76. 3. 若锁协议 majority 检查通过,此时锁协议的参与节点已经确定,若任何一个参与节点退 出,则该退出能在锁协议的同步交互阶段被检测出来,从而导致上锁失败,事务退出;4. 若锁协议 majority 检查通过,请求节点在同步锁请求过程中退出,对于已上锁的参与节 点,锁会因超时而被清除,对于未上锁的参与节点,没有任何影响;5. 若锁协议 majority 检查通过,所有参与节点同意上锁,之后的过程将由事务提交过程接 管,若此后某个参与节点退出,也不会影响到事务协议;6. 确定事务协议的参与节点为表的 where_to_commit 属性,事务协议的参与节点的确立晚 于锁协议参与节点,且与其无关,因此任何在阶段 5 之后有任何节点退出退出,均不会 影响事务协议,事务协议单独决策;7. 若事务协议 majority 检查不通过,则记录无法提交,事务退出;8. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若任何一个参与节 点在 prepare、precommit 阶段退出,则该退出能在事务协议的同步交互阶段被检测出来, 从而导致提交失败,事务退出;9. 若事务协议 majority 检查通过, 请求节点在 prepare、 precommit 阶段退出,对于已 prepare、 precommit 的参与节点,因没有进行实际的提交,不会有任何实际状态的改变,事务描 述符会因超时而被清除,对于未 prepare、precommit 的参与节点,没有任何影响;10. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若任何一个参与节 点在 commit 阶段退出,此时其它参与节点将进行提交,退出节点无法提交,但是能在 恢复时,从已提交节点处获取提交数据,进行自身的提交;11. 若事务协议 majority 检查通过,此时事务协议的参与节点已经确定,若请求节点在 commit 阶段退出,mnesia_monitor 检查到请求节点退出, mnesia_tm 通告 mnesia_down 向 消息,mnesia_tm 在收到了 mnesia_down 消息后,会通过 mnesia_recover 询问其它节点
  • 77. 是否有提交: a) 若 commit 还未进行,各个参与节点彼此询问后,得到的结果仍然是未决的,则一 致认为事务结果为 abort,并向其它节点广播自身的 abort 结果,事务退出,不会 出现不一致状态; b) 若 commit 已经进行了一部分,此时集群中仅存在二类参与节点:已提交的,未决 的。未决的参与节点在询问到已提交的参与节点时,已提交的节点会返回 commit 的结果,未决节点也因此可以 commit。 c) 若 commit 完成,各个参与节点均已提交,不会出现不一致状态;确定事务结果为退出,并向其它节点广播自身的 abort 结果,而已提交的参与节点确定事务结果为提交,此时出现不一致,已提交节点产生{inconsistent_database, bad_decision, Node}消息,需要某种决策解决这个问题;但由于在第三阶段提交时,请求者已经几乎没有什么操作,仅仅是异步广播一条 {Tid,committed}消息,因此,此处出现不一致的情况微乎其微;inconsistent_database 消 息 还 会 出 现 在 运 行 时 : {inconsistent_database,running_partitioned_network, Node}和重新启动时:{inconsistent_database, starting_partitioned_network, Node}。由于这些策略的存在,上述不变式可以处理多个节点退出的情况。由于前两个阶段未提交,因此不会出现不一致的状态,而在第三阶段中:1. 仅请求节点退出,各个参与节点检测到请求节点的退出,mnesia_tm 开始询问其它节点 的事务结果,从全局角度来看: a) 若没有参与节点提交,则所有参与节点都认为该事务 abort; b) 若一部分参与节点提交,则未决参与节点询问提交参与节点后,得到提交结果,最
  • 78. 终事务提交; c) 请求节点在重启时,将根据本地恢复日志确定提交的结果: i. 若没有第一阶段恢复日志 unclear,则事务被认为是中止; ii. 若仅有第一阶段恢复日志 unclear,需等待其他参与节点的结果: 1. 若其他参与节点提交,则根据其他参与节点的提交结果进行提交; 2. 若其他参与节点中止,则根据其他参与节点的中止结果进行中止; 3. 若所有参与节点均不知道结果,也即所有参与节点均仅有第一阶段恢复日 志 unclear,此时认为事务中止(代码注释中写明提交,但经过试验却发 现并非如此,因为 unclear 日志不会落盘) ; iii. 若已有第二阶段恢复日志 committed,则事务被认为是提交;2. 请求节点不退出,仅部分/全部参与节点退出,请求节点与各个存活参与节点检测到请 求节点的退出,mnesia_tm 开始询问其它节点的事务结果,从全局角度来看: a) 第三阶段时,请求节点发出 commit 消息前后,会立即进行提交,其它未决参与节 点至少可以根据请求节点的提交结果,决定提交事务; b) 退出的参与节点在重启时,将根据本地恢复日志和请求节点确定提交的结果,过程 如{1,c}; c) 全部参与节点退出是 b 的特例,各个退出的参与节点在重启时,将根据本地恢复日 志和请求节点确定提交的结果,过程如{1,c};3. 请求节点与参与节点都有退出: a) 若请求节点在发出 commit 消息之前退出,提交还未开始,参与节点无论是否有退 出实际情况均为{1,a},各个节点重启时如{1,c}; b) 若请求节点在发出 commit 消息之后,没有开始提交或部分提交,然后退出,且参
  • 79. 与节点保持如下三种状态: i. 存活参与节点未能收到 commit,并存活; ii. 存活参与节点未能收到 commit,并退出; iii. 收到 commit 消息的参与节点未开始提交时便退出; 则实际情况为{1,a},各个节点重启时如{1,c,i}; c) 若请求节点在发出 commit 消息之后,且提交完成,然后退出,其它参与节点保持 如下三种状态: i. 存活参与节点未能收到 commit,并存活; ii. 存活参与节点未能收到 commit,并退出; iii. 收到 commit 消息的参与节点未开始提交时便退出; 则实际情况为{2,a},各个节点重启时如{1,c,ii,1}; d) 若请求节点在发出 commit 消息之后,没有开始提交或部分提交,然后退出,且有 至少一个参与节点提交完成,则实际情况为{1,b},各个节点重启时如{1,c,ii, 1};4. 一个特殊的情况,场景极难构造,可能造成不一致,但若出现这种情况,则只能是 mnesia 自身的 bug: 这个场景难于构造在 d 步骤,参与节点的其它 mnesia 模块通常也会立即收到 mnesia_tm 引发的{EXIT, Pid, Reason}消息,并且立即退出。 a) 请求节点在发出 commit 消息之后,且提交完成; b) 参与节点的 commit_participant 进程在收到 commit 消息前; c) 参与节点的 mnesia_tm 先一步退出, commit_participant 进程会收到{EXIT, Pid, Reason}消息;
  • 80. d) 参与节点的其它 mnesia 模块还未收到 mnesia_tm 引发的{EXIT, Pid, Reason}消息, 仍在正常工作; e) 参与节点的 commit_participant 进程将决定中止事务,此时出现不一致; f) 再次访问 mnesia 时, mnesia 将抛出{inconsistent_database, bad_decision, Node}消息; g) 重启 mnesia 时协商恢复。3. 节点 down 异步检测1. 检测原理在 mnesia 没有处理任何事务的情况下,若此时 erlang 虚拟机检测到任何节点退出,mnesia需要进行网络分区检查,但是这个检查的流程有些特殊:mnesia_monitor.erlhandle_call(init, _From, State) -> net_kernel:monitor_nodes(true), EarlyNodes = State#state.early_connects, State2 = State#state{tm_started = true}, {reply, EarlyNodes, State2};启动时,mnesia_monitor 将监听节点 up/down 的消息。handle_info({nodedown, _Node}, State) -> %% Ignore, we are only caring about nodeups {noreply, State};mnesia 并不依赖于 nodedown 消息处理节点退出,而是在新节点加入时,通过 link 新节点上的 mnesia_monitor 进程进行节点退出检查:handle_info(Msg = {EXIT,Pid,_}, State) -> Node = node(Pid), if Node /= node(), State#state.connecting == undefined -> %% Remotly linked process died, assume that it was a mnesia_monitor mnesia_recover:mnesia_down(Node),
  • 81. mnesia_controller:mnesia_down(Node), {noreply, State#state{going_down = [Node | State#state.going_down]}}; Node /= node() -> {noreply, State#state{mq = State#state.mq ++ [{info, Msg}]}}; true -> %% We have probably got an exit signal from %% disk_log or dets Hint = "Hint: check that the disk still is writable", fatal("~p got unexpected info: ~p; ~p~n", [?MODULE, Msg, Hint]) end;handle_cast({mnesia_down, mnesia_controller, Node}, State) -> mnesia_tm:mnesia_down(Node), {noreply, State};handle_cast({mnesia_down, mnesia_tm, {Node, Pending}}, State) -> mnesia_locker:mnesia_down(Node, Pending), {noreply, State};handle_cast({mnesia_down, mnesia_locker, Node}, State) -> Down = {mnesia_down, Node}, mnesia_lib:report_system_event(Down), GoingDown = lists:delete(Node, State#state.going_down), State2 = State#state{going_down = GoingDown}, Pending = State#state.pending_negotiators, case lists:keysearch(Node, 1, Pending) of {value, {Node, Mon, ReplyTo, Reply}} -> %% Late reply to remote monitor link(Mon), %% link to remote Monitor gen_server:reply(ReplyTo, Reply), P2 = lists:keydelete(Node, 1,Pending), State3 = State2#state{pending_negotiators = P2}, process_q(State3); false -> %% No pending remote monitors process_q(State2) end;当发现有某个节点的 mnesia_monitor 进程退出时,这时候要依次通知 mnesia_recover、mnesia_controller、mnesia_tm、mnesia_locker、监听 mnesia_down 消息的进程,告知某个节点的 mnesia 退出,最后进行 mnesia_monitor 本身对 mnesia_down 消息的处理。这些处理阶段主要包括:1. mnesia_recover:运行时无动作,仅在初始化阶段对未决事务进行检查;
  • 82. 2. mnesia_controller: a) 通知 mnesia_recover 记录 mnesia_down 历史事件,以便将来退出节点重新连接时 进行分区检查; b) 修 改 所 有 与 退 出 节 点 相 关 的 表 的 几 项 动 态 节 点 属 性 ( where_to_commit 、 where_to_write、where_to_wlock、active_replicas) 以保持表的全局拓扑的正确性; ,3. mnesia_tm:重新配置退出节点参与的事务,若退出节点参与了本节点的 coordinator 组 织的事务,则进行 coordinator 校正,通知 coordinator 一个 mnesia_down 消息,而 coordinator 根据提交阶段进行修正,选择中止事务或提交事务;若退出节点参与了本节 点的 participant 参与的事务,则进行 participant 校正,告知其它节点该退出节点的状态;4. mnesia_locker : 清 除 四 张 锁 表 中 ( mnesia_lock_queue 、 mnesia_held_locks , mnesia_sticky_locks,mnesia_tid_locks),与退出节点相关的锁;5. 上层应用:mnesia 向订阅了 mnesia_down 消息的应用投递该消息;2. mnesia_recover 处理 mnesia_down 消息mnesia_recover.erlmnesia_down(Node) -> case ?catch_val(recover_nodes) of {EXIT, _} -> %% Not started yet ignore; _ -> mnesia_lib:del(recover_nodes, Node), cast({mnesia_down, Node}) end.handle_cast({mnesia_down, Node}, State) -> case State#state.unclear_decision of undefined -> {noreply, State};
  • 83. D -> case lists:member(Node, D#decision.ram_nodes) of false -> {noreply, State}; true -> State2 = add_remote_decision(Node, D, State), {noreply, State2} end end;unclear_decision 仅用于启动时对之前的未决事务进行恢复,运行时期总为 undefined。3. mnesia_controller 处理 mnesia_down 消息mnesia_controller.erlmnesia_down(Node) -> case cast({mnesia_down, Node}) of {error, _} -> mnesia_monitor:mnesia_down(?SERVER_NAME, Node); _Pid -> ok end.handle_cast({mnesia_down, Node}, State) -> maybe_log_mnesia_down(Node), …maybe_log_mnesia_down(N) -> case mnesia_lib:is_running() of yes -> verbose("Logging mnesia_down ~w~n", [N]), mnesia_recover:log_mnesia_down(N), ok; … end.mnesia_controller 对 mnesia_down 的 处 理 过 程 中 , 会 通 知 mnesia_recover 记 录 这 个mnesia_down 事件,以备在该节点启动时进行额外的网络分区检查工作:mnesia_recover.erllog_mnesia_down(Node) -> call({log_mnesia_down, Node}).handle_call({log_mnesia_down, Node}, _From, State) -> do_log_mnesia_down(Node),
  • 84. {reply, ok, State};do_log_mnesia_down(Node) -> Yoyo = {mnesia_down, Node, Date = date(), Time = time()}, case mnesia_monitor:use_dir() of true -> mnesia_log:append(latest_log, Yoyo), disk_log:sync(latest_log); false -> ignore end, note_down(Node, Date, Time).note_down(Node, Date, Time) -> ?ets_insert(mnesia_decision, {mnesia_down, Node, Date, Time}).mnesia_recover 除了记录 mnesia_down 日志外,还会在 ets 表 mnesia_decision 中记录节点的down 历史记录,以备在该节点重新 up 时做检查,这样可以避免由于瞬断 ABA 问题导致的不一致。若不记录历史记录,则在如下情况发生时,将造成不一致:1. 集群由 A、B、C 组成2. A 由于网络问题,出现瞬断,B、C 完好3. B、C 写入数据,满足 majority 条件4. A 网络恢复,并未记录 B、C 出现 mnesia_down 的情况,仍然认为自己在集群中,此时 数据不一致mnesia_controller.erlhandle_cast({mnesia_down, Node}, State) -> … mnesia_lib:del({current, db_nodes}, Node), mnesia_lib:unset({node_up, Node}), mnesia_checkpoint:tm_mnesia_down(Node), Alltabs = val({schema, tables}), reconfigure_tables(Node, Alltabs), …reconfigure_tables(N, [Tab |Tail]) -> del_active_replica(Tab, N),
  • 85. case val({Tab, where_to_read}) of N -> mnesia_lib:set_remote_where_to_read(Tab); _ -> ignore end, reconfigure_tables(N, Tail);reconfigure_tables(_, []) -> ok.del_active_replica(Tab, Node) -> Var = {Tab, where_to_commit}, {Blocked, Old} = is_tab_blocked(val(Var)), Del = lists:keydelete(Node, 1, Old), New = lists:sort(Del), set(Var, mark_blocked_tab(Blocked, New)), % where_to_commit mnesia_lib:del({Tab, active_replicas}, Node), mnesia_lib:del({Tab, where_to_write}, Node), update_where_to_wlock(Tab).update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}).除了记录历史 mnesia_down,mnesia_controller 还要:1.更新全局 db_nodes;2.更新表的各类节点信息,这些信息包括:where_to_commit,where_to_write,where_to_wlock,active_replicas;由于出现 mnesia_down 后,将会立即更新表的 active_replicas 属性,这也给如下策略提供了依据:通过监控 schema 表的 active_replicas,若发现其与配置的 schema 表的 disc 节点不符,且不在 schema 表的 active_replicas 中的 disc 节点却位于 erlang:nodes/0 的列表中,则 ping 这些不 符 的 节 点 , 若 这 些 节 点 可 以 ping 通 , 则 认 为 曾 经 出 现 了 集 群 分 区 , 而 检 测inconsistent_database 消息的进程错过了某个 inconsistent_database 消息,这时可以进行mnesia 的重启和重新协商。
  • 86. 4. mnesia_tm 处理 mnesia_down 消息mnesia_down(Node) -> case whereis(?MODULE) of undefined -> mnesia_monitor:mnesia_down(?MODULE, {Node, []}); Pid -> Pid ! {mnesia_down, Node} end.doit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) -> receive … {mnesia_down, N} -> verbose("Got mnesia_down from ~p, reconfiguring...~n", [N]), reconfigure_coordinators(N, gb_trees:to_list(Coordinators)), Tids = gb_trees:keys(Participants), reconfigure_participants(N, gb_trees:values(Participants)), NewState = clear_fixtable(N, State), mnesia_monitor:mnesia_down(?MODULE, {N, Tids}), doit_loop(NewState); … end.mnesia_tm 需要重新配置 coordinator 与 participant,coordinator 用于请求者,participant 用于充当三阶段提交的参与者。mnesia_tm 重设 coordinator:reconfigure_coordinators(N, [{Tid, [Store | _]} | Coordinators]) -> case mnesia_recover:outcome(Tid, unknown) of committed -> WaitingNodes = ?ets_lookup(Store, waiting_for_commit_ack), case lists:keymember(N, 2, WaitingNodes) of false -> ignore; % avoid spurious mnesia_down messages true -> send_mnesia_down(Tid, Store, N) end;
  • 87. aborted -> ignore; % avoid spurious mnesia_down messages _ -> %% Tell the coordinator about the mnesia_down send_mnesia_down(Tid, Store, N) end, reconfigure_coordinators(N, Coordinators);reconfigure_coordinators(_N, []) -> ok.send_mnesia_down(Tid, Store, Node) -> Msg = {mnesia_down, Node}, send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).reconfigure_coordinators 主要服务于请求者, mnesia_recover 保存了最近的事务的活动状态,对于未明结果的事务,需要向其请求者发送 mnesia_down 消息, 告知其某个参与者节点 down。若此时没有活动事务,节点也未参与任何活动事务,此时 mnesia_tm 不需要重设 coordinator与 participant。请求者处第一阶段提交对 mnesia_down 的处理:rec_all([Node | Tail], Tid, Res, Pids) -> receive … {mnesia_down, Node} -> %% Make sure that mnesia_tm knows it has died %% it may have been restarted Abort = {do_abort, {bad_commit, Node}}, catch {?MODULE, Node} ! {Tid, Abort}, rec_all(Tail, Tid, Abort, Pids) end;请求者处第二阶段提交对 mnesia_down 的处理:rec_acc_pre_commit([Pid | Tail], Tid, Store, Commit, Res, DumperMode, GoodPids, SchemaAckPids) -> receive … {mnesia_down, Node} when Node == node(Pid) -> AbortRes = {do_abort, {bad_commit, Node}}, catch Pid ! {Tid, AbortRes}, %% Tell him that he has died rec_acc_pre_commit(Tail, Tid, Store, Commit, AbortRes, DumperMode, GoodPids, SchemaAckPids)
  • 88. end;请求者处第三阶段提交对 mnesia_down 的处理:对于普通操作,第三阶段是异步的,这时已经不需要监控 mnesia_down 消息了。对于 schema 操作,第三阶段是同步的:sync_schema_commit(_Tid, _Store, []) -> ok;sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.此处仍然参考三阶段提交,执行到这个位置时,请求者的 schema 已经更新完毕,而其它参与者即使 down 了,也可以在重启后恢复。mnesia_tm 重设 participant:reconfigure_participants(N, [P | Tail]) -> case lists:member(N, P#participant.disc_nodes) or lists:member(N, P#participant.ram_nodes) of false -> reconfigure_participants(N, Tail); true -> Tid = P#participant.tid, if node(Tid#tid.pid) /= N -> reconfigure_participants(N, Tail); true -> verbose("Coordinator ~p in transaction ~p died~n", [Tid#tid.pid, Tid]), Nodes = P#participant.disc_nodes ++ P#participant.ram_nodes, AliveNodes = Nodes -- [N], Protocol = P#participant.protocol, tell_outcome(Tid, Protocol, N, AliveNodes, AliveNodes), reconfigure_participants(N, Tail) end end;reconfigure_participants(_, []) -> [].
  • 89. tell_outcome(Tid, Protocol, Node, CheckNodes, TellNodes) -> Outcome = mnesia_recover:what_happened(Tid, Protocol, CheckNodes), case Outcome of aborted -> rpc:abcast(TellNodes, ?MODULE, {Tid,{do_abort, {mnesia_down, Node}}}); committed -> rpc:abcast(TellNodes, ?MODULE, {Tid, do_commit}) end, Outcome.mnesia_recover.erlwhat_happened(Tid, Protocol, Nodes) -> Default = case Protocol of asym_trans -> aborted; _ -> unclear %% sym_trans and sync_sym_trans end, This = node(), case lists:member(This, Nodes) of true -> {ok, Outcome} = call({what_happened, Default, Tid}), Others = Nodes -- [This], case filter_outcome(Outcome) of unclear -> what_happened_remotely(Tid, Default, Others); aborted -> aborted; committed -> committed end; false -> what_happened_remotely(Tid, Default, Nodes) end.handle_call({what_happened, Default, Tid}, _From, State) -> sync_trans_tid_serial(Tid), Outcome = outcome(Tid, Default), {reply, {ok, Outcome}, State};what_happened_remotely(Tid, Default, Nodes) -> {Replies, _} = multicall(Nodes, {what_happened, Default, Tid}), check_what_happened(Replies, 0, 0).check_what_happened([H | T], Aborts, Commits) -> case H of {ok, R} -> case filter_outcome(R) of committed -> check_what_happened(T, Aborts, Commits + 1); aborted ->
  • 90. check_what_happened(T, Aborts + 1, Commits); unclear -> check_what_happened(T, Aborts, Commits) end; {error, _} -> check_what_happened(T, Aborts, Commits); {badrpc, _} -> check_what_happened(T, Aborts, Commits) end;check_what_happened([], Aborts, Commits) -> if Aborts == 0, Commits == 0 -> aborted; % None of the active nodes knows Aborts > 0 -> aborted; % Someody has aborted Aborts == 0, Commits > 0 -> committed % All has committed end.首先询问本地 mnesia_recover,检查其是否保存着事务决议结果,若 mnesia_recover 有决议结果,则使用 mnesia_recover 的决议结果;否则询问其它参与节点决议结果。向其它存活的参与节点询问事务决议结果,可以看出,策略如下:若其他节点中有任何节点abort 了,则事务结果为 abort;若其他节点中没有节点 abort 或 commit,则事务结果为 abort;若其他节点中没有节点 abort 且至少有一个节点 commit,则事务结果为 commit。对于同时出现 abort 和 commit 的情况,mnesia 选择 abort,而在 commit 的参与节点处,由于 abort 的节点会询问其结果,commit 节点发现其与自己的事务结果冲突,会向上报告{inconsistent_database, bad_decision, Node}消息,这需要应用进行数据订正。reconfigure_participants 主要服务于参与者,若参与者得知,其请求者 down,则需要决议事务结果,并向其它节点广播自身发现的结果。参与者处第一、二阶段提交对 mnesia_down 的处理:由于此时没有提交,可以直接令事务中止。参与者处第三阶段提交对 mnesia_down 的处理,且没有任何参与节点开始提交:
  • 91. 由于此时没有提交,可以直接令事务中止。参与者处第三阶段提交对 mnesia_down 的处理,且有某些参与节点已经提交:由于此时有部分提交,则直接从提交节点处得到事务结果,本节点进行提交。5. mnesia_locker 处理 mnesia_down 消息mnesia_locker.erlmnesia_down(N, Pending) -> case whereis(?MODULE) of undefined -> mnesia_monitor:mnesia_down(?MODULE, N); Pid -> Pid ! {release_remote_non_pending, N, Pending} end.loop(State) -> receive … {release_remote_non_pending, Node, Pending} -> release_remote_non_pending(Node, Pending), mnesia_monitor:mnesia_down(?MODULE, Node), loop(State); … end.release_remote_non_pending(Node, Pending) -> ?ets_match_delete(mnesia_sticky_locks, {_ , Node}), AllTids = ?ets_match(mnesia_tid_locks, {$1, _, _}), Tids = [T || [T] <- AllTids, Node == node(T#tid.pid), not lists:member(T, Pending)], do_release_tids(Tids).do_release_tids([Tid | Tids]) -> do_release_tid(Tid), do_release_tids(Tids);do_release_tids([]) -> ok.do_release_tid(Tid) -> Locks = ?ets_lookup(mnesia_tid_locks, Tid), ?dbg("Release ~p ~p ~n", [Tid, Locks]), ?ets_delete(mnesia_tid_locks, Tid), release_locks(Locks), UniqueLocks = keyunique(lists:sort(Locks),[]),
  • 92. rearrange_queue(UniqueLocks).release_locks([Lock | Locks]) -> release_lock(Lock), release_locks(Locks);release_locks([]) -> ok.release_lock({Tid, Oid, {queued, _}}) -> ?ets_match_delete(mnesia_lock_queue, #queue{oid=Oid, tid = Tid, op = _, pid = _, lucky = _});release_lock({_Tid, Oid, write}) -> ?ets_delete(mnesia_held_locks, Oid);release_lock({Tid, Oid, read}) -> case ?ets_lookup(mnesia_held_locks, Oid) of [{Oid, Prev, Locks0}] -> case remove_tid(Locks0, Tid, []) of [] -> ?ets_delete(mnesia_held_locks, Oid); Locks -> ?ets_insert(mnesia_held_locks, {Oid, Prev, Locks}) end; [] -> ok end.mnesia_locker 所 作 的 工 作 就 比 较 直 接 , 即 为 清 除 四 张 锁 表 中 ( mnesia_lock_queue 、mnesia_held_locks,mnesia_sticky_locks,mnesia_tid_locks),与退出节点相关的锁。4. 节点 up 异步检测当退出的节点重新加入时,mnesia 作进行网络分区检查:mnesia_monitor.erlhandle_info({nodeup, Node}, State) -> HasDown = mnesia_recover:has_mnesia_down(Node), ImRunning = mnesia_lib:is_running(), if %% If Im not running the test will be made later. HasDown == true, ImRunning == yes -> spawn_link(?MODULE, detect_partitioned_network, [self(), Node]); true -> ignore end, {noreply, State};
  • 93. 网络分区和不一致的检查过程,将延迟到有新节点加入集群时,若该新节点曾经被本节点认为是 mnesia_down 的,则进行真正的检查过程。mnesia_recover.erlhas_mnesia_down(Node) -> case ?ets_lookup(mnesia_decision, Node) of [{mnesia_down, Node, _Date, _Time}] -> true; [] -> false end.从 ets 表 mnesia_decision 中取回节点的历史记录。mnesia_monitor.erldetect_partitioned_network(Mon, Node) -> detect_inconcistency([Node], running_partitioned_network), unlink(Mon), exit(normal).detect_inconcistency([], _Context) -> ok;detect_inconcistency(Nodes, Context) -> Downs = [N || N <- Nodes, mnesia_recover:has_mnesia_down(N)], {Replies, _BadNodes} = rpc:multicall(Downs, ?MODULE, has_remote_mnesia_down, [node()]), report_inconsistency(Replies, Context, ok).has_remote_mnesia_down(Node) -> HasDown = mnesia_recover:has_mnesia_down(Node), Master = mnesia_recover:get_master_nodes(schema), if HasDown == true, Master == [] -> {true, node()}; true -> {false, node()} end.本节点的检查过程,需要首先向新节点询问,在新节点的拓扑视图中,是否本节点是否也曾经出现过 down 的情况。新节点也将检查自己的历史记录,查看是否本节点曾经 down 过,并返回检查结果。注意,若配置了 master 节点选项,则可以通过 master 节点进行仲裁,可以不被认为 down 过。本节点收到结果后,进行实际的分区检查:
  • 94. report_inconsistency([{true, Node} | Replies], Context, _Status) -> Msg = {inconsistent_database, Context, Node}, mnesia_lib:report_system_event(Msg), report_inconsistency(Replies, Context, inconsistent_database);…若新加入节点认为本节点曾经 down 过,而此时本节点也认为新节点也 down 过,此时 mnesia存在潜在的不一致状态,此时必须通知应用,报告这个不一致消息,此时 Context 为running_partitioned_network,这也意味着 mnesia 是在运行过程中发现的网络分区。除了运行时通过 nodeup 消息对分区进行检查,还需要启动时对分区进行检查,否则在网络分区出现后,本节点关闭再重启,mnesia_down 临时历史记录消失(在日志中仍然记录),无法通过 nodeup 消息进行分区检查。mnesia_recover.erlconnect_nodes(Ns) -> call({connect_nodes, Ns}).handle_call({connect_nodes, Ns}, From, State) -> AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, case mnesia_monitor:negotiate_protocol(Check) of busy -> erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> mnesia_lib:add_list(recover_nodes, GoodNodes), cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State}
  • 95. end;启动过程中,需要与其它节点进行连接,然后进行协议版本协商,若版本兼容,之后将可以与该节点进行交互,若没有 master 节点配置,则将进行分区检查,检查方法仍然使用mnesia_monitor:detect_inconcistency 进行,询问远程节点, mnesia_down 历史记录中是否 其记载了本节点,若有,则存在潜在的不一致,此时的上下文为 starting_partitioned_network。8. 其它