3. 代谢网络拓扑分析及在线粒体进化中的应用
上海交通大学 生命科学技术学院 俞一明
THE STUDY OF MITOCHONDRIA EVOLUTION BASE ON
TOPOLOGY ANALYSIS OF METABOLIC NETWORK
ABSTRACT
According to the theory of endosymbiosis, mitochondria is the endosymbiont of Alpha
Proteobactia which reside within a nucleus-containing (but amitochondriate) host cell.
Single-gene phylogenies (especially SSU rRNA-based ones) have established many of the
currently accepted affiliations among and between eubacterial, mitochondrial, and nuclear
genomes; however, the resolving power of single-gene analyses is limited by the inherently small
information content of individual genes, complicated in the particular case of mitochondria by
extreme differences in base composition. In this case, we try to use do a global analyses of
metabolic networks of mitochondria and related species in the framework of complex network
theory and systems biology, hoping that it will give us a more understanding to endosybiotic
process.
Recently, there has been much progress in understanding the importance of network structure
on its function. Different networks can have the same topological properties, the small world
property has been proved to be existed in social relation network, WWW, and all in biological
networks. In the thesis, I reconstructed 29 enzyme interaction networks, finding that similarity of
topological properties are basically coherent with phylogenical relatedness. Thus it proved again
that structural similarity can deduce functional similarity.
Modularity (the division of a network into much less related sub-networks)has been discussed
in many researches. In this work, simulated annealing is used to find modules. Then each enzyme
in the module is map to its functional categories. The comparison mitochondria modules and that
of rpr and aph show that mitochondria has evolved to be more specific in metabolism in each
module and at the same time, lost many or nearly all reactions in some categories like
biosynthesis of secondary metabolites,which is quite different from rpr and aph. This implies that
the functional difference between mitochondria and rpr is much large than the structural
similarity.
Key words: metabolic network, mitochondria, endosymbiosis, modularity, network evolution
20. 代谢网络拓扑分析及在线粒体进化中的应用
第 16 页 共 7 页
相关而与其他模块中节点的连接较少。为了定量的描述这个原则,Newman 提出了网络模块
性(modularity)的概念。对于一个进行了模块划分的网络,其模块性 M 定义为[46]:
2
1 2
r
s s
s
l d
M
L L
其中,r 是模块的个数,L 是网络中所有节点的链接数目的总和,ls 是模块 s 内部节点间链
接数目的和,ds 是模块 s 内部节点的连接度的总和。在该定义下,0≤M<1,M 值越高,表明
网络的模块化程度越明显,M 值一般分布在 0.3-0.7 之间[43]。当节点随机划分时,M=0。
模拟退火算法是模拟固体的退火过程,对 Metropolis 算法进行迭代的组合优化算法。
设组合优化问题的一个解 i 和目标函数 f(i)分别与固体的一个微观状态 i 和能量状态 Ei 等
价,并用控制参数 t 担当固体退火过程中温度 T 的角色,则对于控制参数 t 的每一取值,算
法持续进行“产生新解-判断-接受/舍弃”的迭代过程,控制参数 t 随算法进程递减其值,
使得整个迭代过程与固体在某一恒定温度下趋于热平衡的过程相对应。模拟退火算法从某个
初始解出发,经过大量解的变换后,可以求得给定控制参数值时组合优化问题的相对最优解。
然后减少控制参数 t 的值,重复执行,就可以在控制参数 t 趋于零时最终求得组合优化问题
的整体最优解[47]。
模拟退火算法的基本思想为:
1) 初始化:初始温度T (充分大),初始解状态 S(是算法迭代的起点), 每个T 值的迭
代次数 L;
2) 对 k=1,……,L,反复做第 3)至第 6)步;
3) 产生新解 S′;
4) 计算增量 Δt′=C(S′) - C(S),其中 C(S)为目标函数;
5) 若 Δt′<0,则接受 S′作为新的当前解,否则以概率 exp(-Δt′/T) 接受 S′作为
新的当前解(Metropolis 准则)
6) 如果满足终止条件则输出当前解作为最优解,结束程序。终止条件通常取为连续若
干个新解都没有被接受时则终止迭代过程。
7) T 逐渐减少,且 T>0,然后转第 2)步。
Guimerà 和 Amaral 提出了利用模拟退火算法寻找使得网络模块性 M 最大的模块结构划
分的算法,即模拟退火聚类算法[15,16]。该算法是将模块性 M 的负值作为模拟退火算法中
的目标函数,每一个新状态的接受概率根据 Metropolis 准则定义为:
1 if
exp( ) if
f i
f i
f i
c c
p c c
c c
T
其中 Cf 是更新后的目标函数,Ci 是更新前的目标函数。
在每个温度 T,通过两类随机移动改变网络模块结构,1) 某一个节点从一个模块随机
移动到另一个模块,做 ni 次,定义 ni=fω2; 2) 随机合并两个模块或者将某一个模块随机
地划分成为两个模块,做 nc 次,定义 nc=fω。其中,ω 为整个网络的节点数目;f 为迭代
因子;温度 T 按照冷却因子 Δ 进行改变,T’=ΔT,一般 Δ∈[0.990,0.999]。当温度连
续改变 25 次模块性 M 都不变时,收敛,得到的收敛状态即为最终的聚类结果[15,16]。
本文选取了迭代因子 f=1,冷却因子 Δ=0.99 的聚类结果作为模拟退火聚类算法的聚类
结果。以线粒体为例,其模块化分解的结果如表 4-1,M= 0.42605。
23. 代谢网络拓扑分析及在线粒体进化中的应用
第 19 页 共 7 页
第 2 个数字(7)代表亚类(磷酸转移酶类),第 3 个数字(1)代表亚亚类(以羟基作为受体的磷
酸转移酶类),第 4 个数字(1)代表该酶在亚-亚类中的排号(D 葡萄糖作为磷酸基的受体)。
相近的 EC 编号具有相似的功能,特别是只有最后一位不同的两个酶是非常相似的,如果我
们只找出各模块间完全相同的酶,就会忽略这些重要的信息,因此我们根据 EC 的等级定义
更为合理的酶相似性评价指标。我们把每一个酶的 EC number 看作一个向量,由四部分组成,
每一部分根据 EC 的等级分别赋予权重 0.1,0.2,0.3,0.4。对两个 EC,我们用向量 P 来描
述它们的一致性及差异性。如果它们在第 k 级的位置相同,则定义 Pk 为 1,否则 Pk 为 0。
酶 i 和 j 的相似性定义如下:
4
1
ij k k
k
S w P
例如,对于酶 1.1.1.2 和 1.1.3.1,其相似性 S 为:
3.004.003.012.011.0 s
注意,比较两个 EC 时要从高的级别到低的级别,如果第 k 级不同,即使第 t (t>=k)级
相同,Pt 依然为 0。例如,对于酶 1.1.1.2 和 1.3.1.2,其相似性 S 为: 0.1 1 0.1s 。
得到了 a,b 两个模块中任意两个酶的相似性之后,对模块 a 中的每一个酶取其与 b 中
酶的相似性的最大值,用 Sbest 表示。则模块 a,b 之间的相似性 Simiab 为:
1
1 aN
ab i
ia
Simi Sbest
N
得到两个物种各个模块之间的相似性以后,需要进一步分析两个物种间的整体模块化结
构相似性。将每一个物种整体看作一个大类,该物种中的每一个模块看作大类中的元素,则
计算两个物种之间的相似性就等价于计算两个大类之间的相似性。
这里引入豪斯道夫度量(Hausdorff metric)的概念,豪斯道夫度量用来测量某些度量空
间中非空点集之间的距离[49]。
令 X 为一个度量空间, x 为它的度量。对一给定点 x X 及一非空集合 A X ,首先
定义 x 到 A 的距离为:
( , ) : inf ( , )H X
a A
x A x a
则对于任意两个非空集合 ,A B X ,集合 A,B 之间的豪斯道夫距离定义为:
( , ): max( ( , ), ( , ))H asym asymA B A B B A
其中
( , ): sup ( , )asym H
a A
A B a B
基于豪斯道夫度量,我们给出了如下的物种间模块化结构相似性的定义:
令 C1 和 C2 表示两个大类(即两个物种),Sspecies(C1, C2)为物种之间的模块化结构
相似性,a 和 b 分别为 C1 和 C2 中的元素(即物种中的模块),首先定义 C1 中元素 a 到 C2
的距离 S(a,C2):
24. 代谢网络拓扑分析及在线粒体进化中的应用
第 20 页 共 7 页
2
2( , ) max[ ( , )]
b C
S a C Simi a b
则 C1 和 C2 之间的相似性 S(C1, C2)定义为:
1
1 2 2( , ) min[ ( , )]
a C
S C C S a C
S(C1,C2)一般是不对称的,根据 S(C1,C2)的定义进一步定义 C2 和 C1 之间的相似性
S’(C2,C1):
2 1
2 1'( , ) min{max[ ( , )]}
b C a C
S C C Simi b a
根据以上几个定义,两个物种间整体模块化结构相似性 Species(C1,C2) 定义为:
1 2 1 2 2 1( , ) min[ ( , ), '( , )]speciesS C C S C C S C C
4.2.2 各物种代谢网络全局模块化结构的比较
基于上一章对于 28 个物种网络整体拓扑特征的比较,选取和线粒体拓扑特征较为相近
的 rpr,mge,mcp,mpn,bbu 以及同为 rickettsias 的 ama,aph,ecn,ech,erg,erw,eru,以及真细菌中的
eco,真核的 sce,古细菌中的 mja 进行模块化结构分析。
基于上节介绍的相似性评价方法,计算了这些物种两两间模块化结构的相似性,如表
4-3 所示。然后根据距离矩阵,利用层次聚类(hierarchical clustering)方法对各物种
进行聚类,得到结果图 4-1 所示。
表 4-3 各物种间整体模块化结构的相似性
ama aph bbu ech ecn eco erg eru Erw mcp mge mit mja mpn rpr sce
ama 1.00 0.52 0.35 0.45 0.39 0.32 0.42 0.65 0.42 0.21 0.20 0.33 0.31 0.20 0.45 0.32
aph 1.00 0.31 0.55 0.38 0.31 0.27 0.49 0.27 0.22 0.20 0.30 0.29 0.20 0.35 0.31
bbu 1.00 0.40 0.41 0.32 0.26 0.36 0.26 0.42 0.35 0.24 0.10 0.44 0.29 0.32
ech 1.00 0.38 0.35 0.46 0.46 0.40 0.21 0.19 0.31 0.32 0.21 0.35 0.40
ecn 1.00 0.43 0.38 0.46 0.38 0.26 0.21 0.32 0.40 0.22 0.36 0.44
eco 1.00 0.31 0.37 0.33 0.28 0.20 0.30 0.40 0.23 0.28 0.55
erg 1.00 0.77 0.74 0.23 0.17 0.31 0.32 0.17 0.48 0.34
eru 1.00 0.71 0.24 0.21 0.32 0.32 0.24 0.41 0.36
erw 1.00 0.23 0.17 0.30 0.33 0.24 0.33 0.33
mcp 1.00 0.48 0.22 0.10 0.48 0.24 0.21
mge 1.00 0.26 0.10 0.64 0.18 0.15
mit 1.00 0.24 0.31 0.27 0.33
mja 1.00 0.10 0.28 0.27
mpn 1.00 0.18 0.20
rpr 1.00 0.33
sce 1.00
30. 代谢网络拓扑分析及在线粒体进化中的应用
第 26 页 共 7 页
参考文献
[1] Erdos P,Renyi A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad.
Sci.,1960,5:17~60
[2] Watts D J, Strogatz S H. Collective dynamics of ‘small-world’ networks. Nature,1998,
393(6684):440~442
[3] Barabasi A L.Albert R. Emergence of scaling in random networks.
Science,1999,286(5439):509~512
[4] Kitano H. Computational systems biology. Nature, 2002, 402: 206-210
[5] 杨胜利,系统生物学研究进展,中国科学院院刊,2004,19(1):31-34
[6] Ideker T. Systems biology-what you need to know. Nature Biotech, 2004, 22: 473-475
[7] 吴家睿, 系统生物学面面观, 科学, 2002, 45(6): 21-24
[8] Jeong H, Tombor B, Albert1 R, etc. The large-scale organization of metabolic networks.
Nature, 2000, 407: 651-654
[9] Wagner A and Fell DA. The small world inside large metabolic networks. Proc. R. Soc.
Lond. B, 2001, 268: 1803-1810
[10] Ma HW and Zeng A-P. Reconstruction of metabolic networks from genome data and
analysis of their global structure for various organisms. Bioinformatics, 2003, 19: 270-277
[11] Hartwell LH, Hopfield JJ, Leibler S, etc. From molecular to modular cell biology. Nature,
1999, 402: C47–C52
[12] Ravasz E, Somera AL, Mongru DA, etc. Hierarchical organization of modularity in
metabolic networks. Science, 2002, 297: 1551–1555
[13] Rives AW and Galitski T. Modular organization of cellular networks. Proc. Natl. Acad. Sci.
U. S. A. 2003, 100: 1128–1133
[14] Papin JA, Reed JL and Palsson BO. Hierarchical thinking in network biology: the unbiased
modularization of biochemical networks. Trends Biochem. Sci. 2004, 29: 641–647
[15] Guimerà R and Amaral LAN. Functional cartography of complex metabolic networks,
Nature, 2005a, 433: 895-900
[16] Guimerà R and Amaral LAN. Cartography of complex networks: Modules and universal
roles. J. Stat. Mech. Theor. Exp. 2005b, P02001, 1-13
[17] Ma HW and Zeng A-P. Reconstruction of metabolic networks from genome data and
analysis of their global structure for various organisms. Bioinformatics, 2003, 19: 270-277
[18] 王希成,生物化学,北京,清华大学出版社,2001,197-211
[19] 丁明孝,王喜忠,王永潮等,细胞生物学,北京,高等教育出版社,1995,159-170;
374-383
31. 代谢网络拓扑分析及在线粒体进化中的应用
第 27 页 共 7 页
[20] 匡廷云,马克平,白克智,生物质能研发展望,中国科学基金,2005(6):326-330
[21] Michael W.Gray, Gertraud B, B.Franz L. Mitochondrial Evolution Science 283(5407);1476
[22] Eisen JA. Horizontal gene transfer among microbial genomes: new insights from complete
genome analysis. Curr. Opin. Genet. Dev. 2000, 10: 606-611
[23] Aravind L, Tatusov RL, Wolf YI, Walker DR, Koonin EV. Evidence for massive gene
exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 1998, 14:
442-444
[24] Garcia-Vallve S, Romeu A, Palau J. Horizontal gene transfer in bacterial and archaeal
complete genomes. Genome Res. 2000, 10: 1719–1725
[25] Hedges SB. The origin and evolution of model organisms. Nat. Rev. Genet. 2002, 3:
838-849
[26] Martin W. Mosaic bacterial chromosomes: a challenge on route to a tree of genomes.
Bioessays, 1999, 21: 99–104
[27] Woese CR. Interpreting the universal phylogenetic tree. Proc. Natl. Acad. Sci. U. S. A.
2000,
15: 8392–8396
[28] Woese CR. On the evolution of cells. Proc. Natl. Acad. Sci. U. S. A. 2002, 99: 8742–8747
[29] Dutta C, Pan A. Horizontal gene transfer and bacterial diversity. J. Biosci. 2002, 27: 27-33
[30] Jain R, Rivera MC, Moore JE, Lake JA. Horizontal gene transfer in microbial genome
evolution. Theor. Popul. Biol. 2002, 61: 489-495
[31] http://wit.mcs.anl.gov/WIT2/
[32] http://biocyc.org/
[33] Brown JR. Ancient horizontal gene transfer. Nature genetics, 2003, 4: 121-132
[34] http://www.ncgr.org/programs/pathways/
[35] http://www.genome.ad.jp/kegg/
[36] Kanehisa M and Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids
Res, 2000, 28(1): 27-30
[37] Albert R, Barabási A L. Statistical mechanics of complex networks. Reviews of modern
physics, 2002, 74: 47-97
[38] Barabási AL, Albert R. Emergence of scaling in random networks. Science, 1999, 286:
509-512
[39] Goto S, Nishioka T, Kanehisa M. LIGAND: chemical database for enzyme reactions.
Bioinformatics, 1998, 14: 591-599
[40] Bairoch A. The ENZYME data bank in 1995. Nucleic Acids Res., 1996, 24: 221–222
[41] Ravasz E, Somera AL, Mongru DA, etc. Hierarchical organization of modularity in
metabolic networks. Science, 2002, 297: 1551–1555
[42] Newman MEJ and Girvan M. Finding and evaluating community structure in networks.
Phys. Rev. E 69, 2004, 026113
32. 代谢网络拓扑分析及在线粒体进化中的应用
第 28 页 共 7 页
[43] Guimerà R, Sales-Pardo M, and Amaral LAN. Modularity from fluctuations in random
graphs and complex networks. Phys. Rev. E 70, 2004, 025101(R)
[44] Schilling CH, Palsson BO. Assessment of the metabolic capabilities of Haemophilus
influenzae Rd through a genome-scale pathway analysis. J Theor Biol, 2000, 203: 249-283
[45] Redner S. An empirical study of the citation distribution. Eur. Phys. J. B, 1998, 4: 131-134
[46] Newman MEJ and Girvan M. Finding and evaluating community structure in networks.
Phys. Rev. E 69, 2004, 026113
[47]Kirkpatrick S, Gelatt CD and Vecchi MP. Optimization by simulated annealing. Science, 1983,
220: 671–680
[48]Glazko GV, Mushegian AR. Detection of evolutionarily stable fragments of cellular pathways
by hierarchical clustering of phyletic patterns. Genome Biol. 2004, 5: R32 1-13
[49] Nicolas A, Diego SC, Touradj E. MESH: measuring errors between surfaces using the
hausdorff distance. In Proceedings of the IEEE International Conference in Multimedia and
Expo (ICME), 2002, 705-708
[50] Lynn Margulis, René Fester Symbiosis as a source of evolutionary innovation:speciation
and morphogenesis. MIT Press 1991 ISBN 0262132699
Duarte, N.C., Herrgard, M.J., and Palsson, B.O. "Reconstruction and Validation of Saccharomyces
cerevisiae iND750, a Fully Compartmentalized Genome-scale Metabolic Model" Genome Research,
2004.
33. 代谢网络拓扑分析及在线粒体进化中的应用
第 29 页 共 7 页
附录
以 mja 为例,构建酶与酶相互关联网络的程序 corelate.pl
my $count=0;
my @array=();
open(LIST,'keggdocentry_all.txt')||die "$!";
foreach(<LIST>)
{
chomp;
$array{$count}=$_;
$count++;
}
close(LIST);
##############################################
$count=0;
####################################these are entries of a certain
speciesopen(LIST,'keggdocmjaentry.txt')||die "$!";
foreach(<LIST>)
{ s/^s+//;
chomp;
my @array=split(/s+/,$_);
$mja_entry{$array[0]}=1;
$mja{$count}=$array[0];
$count++;
}
close(LIST);
######################################
$count=0;
open(LIST,'keggdocr1.txt')||die "$!";
foreach(<LIST>)
{
chomp;
my @array=split(/s+/,$_);
if(exists $mja_entry{$array{$count}}){
#print $mja_entry{$array{$count}},"n";
$reac{$array{$count}}=$_; ####extract reactions only in this species
}
$count++;
35. 代谢网络拓扑分析及在线粒体进化中的应用
第 31 页 共 7 页
open(OUT,'>keggdocmjacorelation.txt');
my $len=@array=sort keys(%mja_entry);
for(my $i=0;$i<$len;$i++)
{
for(my $j=$i;$j<$len;$j++)
{
my @tmparray1=split(/s+/,$reac{$array[$i]});
my @tmparray2=split(/s+/,$proc{$array[$i]});
my @tmparray3=split(/s+/,$reac{$array[$j]});
my @tmparray4=split(/s+/,$proc{$array[$j]});
foreach my $arrayvalue1(@tmparray1)
{
$printtag=0;
next if(!$arrayvalue1);
foreach my $arrayvalue2(@tmparray4)
{
next if(!$arrayvalue2);
if($arrayvalue1 eq $arrayvalue2)
{
my @tmparray5=split(/s+/,$enzyme{$array[$i]});
my @tmparray6=split(/s+/,$enzyme{$array[$j]});
$printtag=1;
foreach my $arrayvalue3(@tmparray5)
{
foreach my $arrayvalue4(@tmparray6)
{ next if($arrayvalue4 eq $arrayvalue3);
print OUT "$arrayvalue4t$arrayvalue3n";
}
}
last;
}
}
last if($printtag==1);
}
foreach my $arrayvalue1(@tmparray3)
{
$printtag=0;
next if(!$arrayvalue1);
foreach my $arrayvalue2(@tmparray2)
{
next if(!$arrayvalue2);
if($arrayvalue1 eq $arrayvalue2)
{
my @tmparray5=split(/s+/,$enzyme{$array[$j]});
36. 代谢网络拓扑分析及在线粒体进化中的应用
第 32 页 共 7 页
my @tmparray6=split(/s+/,$enzyme{$array[$i]});
$printtag=1;
foreach my $arrayvalue3(@tmparray5)
{
foreach my $arrayvalue4(@tmparray6)
{next if($arrayvalue4 eq $arrayvalue3);
print OUT "$arrayvalue4t$arrayvalue3n";
}
}
last;}}
last if($printtag==1);
} }}
close(OUT);
41. 代谢网络拓扑分析及在线粒体进化中的应用
第 37 页 共 7 页
我们的地图表示为提供了一种从复杂网络结构中分析数据并得到关于网络及其各组成
的功能的认识的标度特异性方法。一个尚未解决的问题是如何把现有的模块识别算法运用到
等级网络结构中。
对于代谢网络,这个相对来说已经被研究的比较深入的例子,我们的方法是我们能重新
发现已经确凿的生物学事实,并且发现新的重要结果,比如非中心连接点代谢物的显著的保
守性。可以预期,如果这一方法运用到其他不想代谢网络那样研究很透的复杂网络中也会得
到相似的结果。这其中,蛋白质相互作用网络和基因调控网络可能是最显著的例子。
把网络中的节点分到不同的模块,对于一个给定的划分,这一划分的模块数 M 是:
其中 N 是模块数,L 是网络中的连接数,Is 是模块 s 中的节点之间的连接数。d s 是模块 s
中各节点的度的总合。这样定义模块化的理由如下。
对网络进行模块划分,一个好的划分必须有尽可能多的模块内连接和尽可能少的模块间连
接。然而,如果我们尝试是模块间连接最少(或者,等价的,市模块内连接数最大)最佳的
划分将只有一个模块,且没有模块间连接。公式(1)注意到这一问题,通过假定当节点是
随机放到各模块或者所有节点在一个簇里时 M=0 来处理。
一个模块发掘算法的目标是发现最大模块系数的划分。已提出了几个方法来达到这一目
的。大多数的算法依靠启发式步骤并且使用 M,或类似的,只评价它们的表现。像比较而言,
我们用模拟退火来发现最大模块化的划分。
模拟退火用于模块识别
模拟退火是一中随机最优化技术,它使你能发现低花费的构象,不会被停留在高花费的
局部最小值。这是通过使用一个计算机温度 T。当 T 高时,四通可以发觉高化肥的构象,而
当温度低的时候,系统只会搜索低花费的区域。通过从高温 T 开始缓慢降温,系统逐步下降
到最小值,最终克服小的局部极小的壁垒。
当识别模块式,目标是使模块数最大化。因而花费是 C=-M, M 是公式(1)中已经定义
好的。在每一个温度下,我们执行一顶数量的随机初始化并一概率接受它们:
其中 C f 是初始化后的花费,C I 是初始化前的花费。
特别是,我们提出在各个温度下,有 n i = fS 2
个节点从一个模块移到另一个模块,当 S
是网络中节点的数目时。我们提出 n c = fS 次移动,其中包括合并两个模块或者分裂一个
模块。对于 f, 我们特地选择 f = 1。 在温度 T,当移动经过评估后,系统被降温到 T' = cT,
c = 0.995.
42. 代谢网络拓扑分析及在线粒体进化中的应用
第 38 页 共 7 页
模块内度和参与常数
每个模块有几种不同的组织方式,从完全中心化的——有一个或几个节点连接所有其他
节点——到完全去中心化的,即所有节点都与相近的连接度。扮演相似角色的节点应该有相
近的模块内连接度。如果 I 是节点 I 到 s i 内其他节点的连接数, ¯si 是 对 s I 中所有节点
的平均数, si 是 sI 中 的标准方差,那么,
这就是所谓的 z-分。模块内的 z-分值表示节点 i 和其他统一模块的节点的连接程度。不同
的角色也可能是因为一个节点对其他模块的连接度。比如,两个有着相同 z-分的节点可能
扮演不同的角色,如果其中一个和其他模块的几个节点相连,而另一个没有。我们节点 i 的
参与系数 P I 如下:
其中 is 是节点 i 到模块 s 的节点的总连接数,k i 是结点 i 的总度数。因而节点的参与系
数接近 1 表示他的连接均匀分布于各模块,等于 0 表示连接全在自身模块内。
丢失速率
为了量化角色和保守性的关系,我们计算了代谢物从多大程度上又他们扮演的角色来决
定在各物种中的保守性的。特别的,对于一对物种,A 和 B, 我们定义丢失速率为概率 p(R
A = 0|R B = R) p lost(R), 如果一个代谢物在一个物种中扮演角色 R( R B = R)则它
不存在与给定的一个物种中(R A = 0)。结构上相关的角色应该有较小的 p lost(R),反之
亦然。
43. 代谢网络拓扑分析及在线粒体进化中的应用
第 39 页 共 7 页
Functional cartography of complex metabolic
networks
Roger Guimerà
1
and Luís A. Nunes Amaral
1
1. NICO and Department of Chemical and Biological Engineering,
Northwestern University, Evanston, Illinois 60208, USA
Correspondence to: Luís A. Nunes Amaral
1
Correspondence and requests for
materials should be addressed to L.A.N.A.
(Email: amaral@northwestern.edu).
Top of page
Abstract
High-throughput techniques are leading to an explosive growth in the size
of biological databases and creating the opportunity to revolutionize our
understanding of life and disease. Interpretation of these data remains,
however, a major scientific challenge. Here, we propose a methodology that
enables us to extract and display information contained in complex
networks
1, 2, 3
. Specifically, we demonstrate that we can find functional
modules
4, 5
in complex networks, and classify nodes into universal roles
according to their pattern of intra- and inter-module connections. The
method thus yields a 'cartographic representation' of complex networks.
Metabolic networks
6, 7, 8
are among the most challenging biological networks
and, arguably, the ones with most potential for immediate applicability
9
.
We use our method to analyse the metabolic networks of twelve organisms
from three different superkingdoms. We find that, typically, 80% of the
nodes are only connected to other nodes within their respective modules,
and that nodes with different roles are affected by different evolutionary
constraints and pressures. Remarkably, we find that metabolites that
participate in only a few reactions but that connect different modules
are more conserved than hubs whose links are mostly within a single module.
If we are to extract the significant information from the topology of a
large, complex network, knowledge of the role of each node is of crucial
importance. A cartographic analogy is helpful to illustrate this point.
Consider the network formed by all cities and towns in a country (the nodes)
and all the roads that connect them (the links). It is clear that a map
in which each city and town is represented by a circle of fixed size and
each road is represented by a line of fixed width is hardly useful. Rather,
44. 代谢网络拓扑分析及在线粒体进化中的应用
第 40 页 共 7 页
real maps emphasize capitals and important communication lines so that
we can obtain scale-specific information at a glance. Similarly, it is
difficult, if not impossible, to obtain information from a network with
hundreds or thousands of nodes and links, unless the information about
nodes and links is conveniently summarized. This is particularly true for
biological networks.
Here, we propose a methodology, which is based on the connectivity of the
nodes, that yields a cartographic representation of a complex network.
The first step in our method is to identify the functional modules
4, 5
in
the network. In the cartographic picture, modules are analogous to
countries or regions, and enable a coarse-grained, and thus simplified,
description of the network. Then we classify the nodes in the network into
a small number of system-independent 'universal roles'.
It is common that social networks have communities of highly
interconnected nodes that are less connected to nodes in other communities.
Such modular structures have been reported not only in social networks
5,
10, 11, 12
, but also in food webs
13
and biochemical networks
4, 14, 15, 16
. It is widely
believed that the modular structure of complex networks plays a critical
role in their functionality
4, 14, 16
. There is therefore a clear need to
develop algorithms to identify modules accurately
5, 11, 17, 18, 19, 20
.
We identify modules by maximizing the network's modularity
11, 18, 21
using
simulated annealing
22
(see Methods). Simulated annealing enables us to
perform an exhaustive search and to minimize the problem of finding
sub-optimal partitions. It is noteworthy that, in our method, we do not
need to specify a priori the number of modules; rather, this number is
an outcome of the algorithm. Our algorithm is able to reliably identify
modules in a network whose nodes have as many as 50% of their connections
outside their own module (Fig. 1).
Figure 1: Performance of module identification methods.
To test the performance of the method, we build 'random networks' with
known module structure. Each test network comprises 128 nodes divided into
45. 代谢网络拓扑分析及在线粒体进化中的应用
第 41 页 共 7 页
4 modules of 32 nodes. Each node is connected to the other nodes in its
module with probability p i, and to nodes in other modules with probability
p o < p i. On average, thus, each node is connected to k out = 96 p o nodes
in other modules and to k in = 31 p i in the same module. Additionally,
p i and p o are selected so that the average degree of the nodes is k =
16. We display networks with: a, k in = 15 and k out = 1; b, k in = 11 and
k out = 5; and c, k in = k out = 8. d, The performance of a module identification
algorithm is typically defined as the fraction of correctly classified
nodes. We compare our algorithm to the Girvan–Newman algorithm
5,18
, which
is the reference algorithm for module identification
11,18,19
. Note that our
method is 90% accurate even when half of a node's links are to nodes in
outside modules. e, Our module-identification algorithm is stochastic,
so different runs yield, in principle, different partitions. To test the
robustness of the algorithm, we obtain 100 partitions of the network
depicted in c and plot, for each pair of nodes in the network, the fraction
of times that they are classified in the same module. As shown in the figure,
most pairs of nodes are either always classified in the same module (red)
or never classified in the same module (dark blue), which indicates that
the solution is robust.
High resolution image and legend (76K)
When considering modular networks, it is plausible to surmise that the
nodes in a network are connected according to the role they fulfil. This
fact has been long recognized in the analysis of social networks
23
. For
example, in a classical hierarchical organization, the chief executive
is not directly connected to plant employees but is connected to the
members of the board of directors. Such a statement holds for virtually
any organization; that is, the role of chief executive is defined
irrespective of the particular organization considered.
We propose a new method to determine the role of a node in a complex network.
Our approach is based on the idea that nodes with the same role should
have similar topological properties
24
(see Supplementary Information for
a discussion on how our approach relates to previous work). We predict
that the role of a node can be determined, to a great extent, by its
within-module degree and its participation coefficient, which define how
the node is positioned in its own module and with respect to other modules
25,
26
(see Methods). These two properties are easily computed once the modules
of a network are known.
46. 代谢网络拓扑分析及在线粒体进化中的应用
第 42 页 共 7 页
The within-module degree z i measures how 'well-connected' node i is to
other nodes in the module. High values of z i indicate high within-module
degrees and vice versa. The participation coefficient P i measures how
'well-distributed' the links of node i are among different modules. The
participation coefficient P i is close to 1 if its links are uniformly
distributed among all the modules, and 0 if all its links are within its
own module.
We define heuristically seven different universal roles, each defined by
a different region in the z–P parameter space (Fig. 2). According to the
within-module degree, we classify nodes with z 2.5 as module hubs and
nodes with z < 2.5 as non-hubs. Both hub and non-hub nodes are then more
finely characterized by using the values of the participation coefficient
(see Supplementary Information for a detailed justification of this
classification scheme, and for a discussion on possible alternatives).
Figure 2: Roles and regions in the z–P parameter space.
47. 代谢网络拓扑分析及在线粒体进化中的应用
第 43 页 共 7 页
a, Each node in a network can be characterized by its within-module degree
and its participation coefficient (see Methods for definitions). We
classify nodes with z 2.5 as module hubs and nodes with z < 2.5 as non-hubs.
We find that non-hub nodes can be naturally assigned into four different
roles: (R1) ultra-peripheral nodes; (R2) peripheral nodes; (R3) non-hub
connector nodes; and (R4) non-hub kinless nodes. We find that hub nodes
can be naturally assigned into three different roles: (R5) provincial hubs;
(R6) connector hubs; and (R7) kinless hubs (see text and Supplementary
Information for details). b, Metabolite role determination for the
metabolic network of E. coli, as obtained from the MZ database. Each
metabolite is represented as a point in the z–P parameter space, and is
coloured according to its role. c, Same as b but for the complete KEGG
database.
High resolution image and legend (188K)
We find that non-hub nodes can be naturally divided into four different
roles: (R1) ultra-peripheral nodes; that is, nodes with all their links
within their module (P 0.05); (R2) peripheral nodes; that is, nodes with
most links within their module (0.05 <P 0.62); (R3) non-hub connector
nodes; that is, nodes with many links to other modules (0.62 < P 0.80);
and (R4) non-hub kinless nodes; that is, nodes with links homogeneously
distributed among all modules (P > 0.80). We find that hub nodes can be
naturally divided into three different roles: (R5) provincial hubs; that
is, hub nodes with the vast majority of links within their module (P
0.30); (R6) connector hubs; that is, hubs with many links to most of the
other modules (0.30 < P 0.75); and (R7) kinless hubs; that is, hubs with
links homogeneously distributed among all modules (P > 0.75).
To test the applicability of our approach to complex biological networks,
we consider the cartographic representation of the metabolic networks
6,
7, 8, 9, 14
of twelve organisms: four bacteria (Escherichia coli, Bacillus
subtilis, Lactococcus lactis and Thermasynechococcus elongatus), four
eukaryotes (Saccharomyces cerevisiae, Caenorhabditis elegans,
Plasmodium falciparum and Homo sapiens), and four archaea (Pyrococcus
furiosus, Aeropyrum pernix, Archaeoglobus fulgidus and Sulfolobus
solfataricus). In metabolic networks, nodes represent metabolites and two
nodes i and j are connected by a link if there is a chemical reaction in
which i is a substrate and j a product, or vice versa. In our analysis,
we use the database developed by Ma and Zeng
8
(MZ) from the Kyoto
Encyclopedia of Genes and Genomes
27
(KEGG). The results we report are not
48. 代谢网络拓扑分析及在线粒体进化中的应用
第 44 页 共 7 页
altered if we consider the complete KEGG database instead (Figs 2c and
4b, and Supplementary Information).
Figure 3: Cartographic representation of the metabolic network of E. coli.
Each circle represents a module and is coloured according to the KEGG
pathway classification of the metabolites it contains. Certain important
nodes are depicted as triangles (non-hub connectors), hexagons (connector
hubs) and squares (provincial hubs). Interactions between modules and
nodes are depicted using lines, with thickness proportional to the number
of actual links. Inset: metabolic network of E. coli, which contains 473
metabolites and 574 links. This representation was obtained using the
program Pajek. Each node is coloured according to the 'main' colour of
its module, as obtained from the cartographic representation.
High resolution image and legend (66K)
Figure 4: Roles of metabolites and inter-species conservation.