SlideShare a Scribd company logo
メモリより⼤大きいデータの 
Suffix Tree 構築⽅方法の紹介 
2014-08-29 
星野 喬 
1
概要 
• 主に以下の論論⽂文の紹介 
• Suffix trees for inputs larger than main 
memory 
– Marina Barsky, Ulrike Stege and Alex Thomo 
– Information Systems archive Volume 36 Issue 
3, May, 2011 
– CIKM2009 での short paper の発展版 
2
論論⽂文の主張 
• メモリより⼤大きいデータの suffix tree の 
構築⽅方法の提案 
• 既存の構築⽅方法と違い,ディスクアクセ 
スはほぼシーケンシャルアクセスのみ 
3
Suffix tree とは (復復習) 
• ⾼高速な⽂文字列列操 
作のための suffix 
を⽤用いた⽊木構造 
– 葉葉ノード: 
suffix を表す 
– 節ノード: 
suffixes 間の共通 
prefix を表す 
4 
most N −1 internal nodes in the tree. Hence, the maximum number 
linear in N. The tree’s total space is linear in N in the case that each 
stored in a constant space. Fortunately, this is the case for an implicit 
substrings by their positions. 
R 
a b a b c 
0 1 2 3 4 
0 
ab 
abc 
1-1 b 
4-4 
abc 
1 
c 
4 
3 
c 
c 
2 
0-1 
2-4 4-4 2-4 4-4 
= ababc. For clarity, the explicit edge labels are shown, which are represented as 
in the actual suffix tree. Each suffix Si can be found by concatenating substrings
提案アルゴリズム: B2ST 
• Big tree, Big string Suffix Tree construction 
algorithm 
• 3 つのフェーズ 
– (1) Input partitioning 
– (2) Pair-wise sorting 
– (3) Merging 
5
変数の意味 
• X: ⼊入⼒力力⽂文字列列 (⻑⾧長さN) 
• Xij: X[i]からX[j-1]までの部分⽂文字列列 
• Si: i 番⽬目の suffix (0 = i  N) 
• LCPij: Si と Sj の longest common prefix 
• |LCPij|: LCPij の⻑⾧長さ 
• SA: suffix array 
• ST: suffix tree 
6
partition, and its positions are not included into the suffix array of the partition, 
positions are not included into the suffix array of the partition. This ensures suffix Input partitioning 
Si of a partition Xu (0 ≤ i  |Xu|, 0 ≤ u  k) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded 
• X を k 個の Xu に分割 (0 = u  k) 
• k = 2N/M (M: メモリサイズ) 
• 各 Xu は tail を持つ 
– Xu 内での suffix sorting 時の区別のため 
– tail と同じ部分⽂文字列列が含まれないように 
– 遺伝⼦子だと⾼高々 1000 くらい 
characters in our experiments with DNA databases). 
Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representing 
positions. Note that the tail of partition XB is substring bbb, which never = aabba. 
Partition A Partition B Partition C Partition D 
a b a b a a a b b a b b b a b a a b a b 
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 
Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is
Pair-wise sorting 
• 全ての XuXv について SA を LCP 付きで構 
築 (SAuv) 
– XuXv: Xu と Xv を結合したもの (u  v) 
– オンメモリで処理理できるように k を調整済 
– SA 内で隣隣り合う suffix 同⼠士の LCP も保存 
• 出⼒力力 
– SAu: Xu の SA 
– Ruv: SAuv 内の LCP と uv を表す bit の列列 
8
In this step we generate suffix arrays for each pair of partitions. The pseudocode for this 
step is shown in Figure 3. We concatenate every possible pair u, v of partitions with their 
tails (0 ≤ u  k − 1, u + SAu 1 ≤ v  k, main memory and build the suffix array と u  v) into Ruv string XuXv. のWe 例例load  
this input into the 
SAuv with attached LCP length information for 
each suffix. The suffixes which start in tail positions are excluded from the output suffix 
array, they serve only for determining the relative order of suffixes starting at the end of 
each partition. An LCP entry of SAuv is the length of the longest common prefix of each 
suffix in SAuv with its immediate predecessor. Figure • SAu 4 shows size: such an 4N/array k for + the α 
pair 
A, B of partitions for the same input string as in Figure 2. From each SAuv, we extract 
– N/k  2^32 が前提 
• Ruv size: 8N/k + α 
• SAu は k 個 
• Ruv は k(k-1)/2 個 
• 計 4kN + α: O(kN) 
9 
for determining the relative order of suffixes starting at the end of 
LCP entry of SAuv is the length of the longest common prefix of each 
immediate predecessor. Figure 4 shows such an array for the pair 
the same input string as in Figure 2. From each SAuv, we extract 
SAAB (in memory) 
suffix start 
LCP 
partition bit 
5 1 3 1 2 5 4 2 4 3 
0 2 1 3 2 3 0 2 3 1 
A B A A B B A A B B 
RAB 
SAAB (in memory) 
suffix start 
LCP 
partition bit 
5 1 3 1 2 5 4 2 4 3 
0 2 1 3 2 3 0 2 3 1 
A B A A B B A A B B 
LCP 0 2 1 3 2 3 0 2 3 1 
partition bit A B A A B B A A B B 
SAA 
5 3 1 4 2 
RAB 
LCP 0 2 1 3 2 3 0 2 3 1 
partition bit A B A A B B A A B B 
SAA 
written 5 3 to 1 disk 
4 2 
written to disk 
with LCP information for a pair of partitions A and B. Two structures are 
the suffix array of partition A and (2) the order array RAB storing the relative 
B. These two structures are written to disk. 
Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures are 
extracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relative 
order of suffixes in A and B. These two structures are written to disk.
Pair-wise sorting pseudo code 
10 
Algorithm pairwiseSorting 
input: k partitions of string X 
1. for Algorithm (u=0; upairwiseSorting 
k-1; i++) 
2. for input: (v=1; k partitions vk; v++) 
of string X 
3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++) 
v++) 
and load into RAM 
4. 3. build concatenate suffix array XuXv with and load LCP into SAuv 
RAM 
5. 4. during build sequential suffix array scan with LCP of SAuv 
SAuv 
6. 5. if during v==k-sequential 1 //last scan chunk 
of SAuv 
6. output if v==k-1 to //last chunk 
7. 7. output to disk disk SAv 
SAv 
8. 8. output output to to disk disk SAu 
SAu 
9. 9. output output to to disk disk Ruv Ruv //order //order array 
array 
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. 
Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. 
Step 2: Suffix sorting in partition pairs
Merging 
• 2 つの部品で構成 
– SAu と Ruv のストリームから SA のストリー 
ムを作る 
– SA のストリームから ST を作る 
• オンメモリデータ 
– SA_buf_u: SAu ⼊入⼒力力バッファ(size m, k 個) 
– R_buf_uv: Ruv ⼊入⼒力力バッファ(size 2m/k, 
k(k-1)/2 個) 
– ST_buf: ST 出⼒力力バッファ (多分 size M-km) 
11
SA ストリーム作成 
12 
SAu 
... 
Ruv 
... 
HeapSA with LCP 
Heap 構造を使って SAu (0 = u  k) 
の中で⼀一番⼩小さい suffix を持つ u を 
選んで出⼒力力
5. Merge algorithm 
13 
What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8. 
Algorithm merge 
1. lastTransferred = null 
2. while heap is not empty 
3. remove smallest suffix Si of partition u 
from the top of the heap 
4. rebalance heap 
5. lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
4. rebalance heap 
5. lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
15. lastTransferred = null 
16. Sj = get next suffix from SA_bufu 
17. if Sj is not null 
18. insert Sj into heap 
Fig. 8. The general pseudocode for merge. 
lcp = 0 
6. if lastTransferred is not null 
7. v = lastTransferred.partitionID 
8. lcp = LCP from R_bufuv [current_pointer] 
9. create leaf for Si using lcp in ST_buf 
10. advancePointers (u) 
11. lastTransferred=Si 
12. if ST_buf is full 
13. store Si (max suffix) 
as a pointer to the current tree 
14. write ST_buf to disk 
15. lastTransferred = null 
16. Sj = get next suffix from SA_bufu 
17. if Sj is not null 
18. insert Sj into heap 
Fig. 8. The general pseudocode for merge.
the information we need to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input u  v). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions  
partition access pair (k k (of total size plus k(k − 1)/2 order arrays for each possible pair of partitions (of total size kN). 
Algorithm initializeMerge 
This is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this without 
loading the entire input string into main memory. In fact, we never access X anymore. 
14 
1. for each SA_bufu 
2. read first m start positions 
from disk suffix array SAu 
Algorithm initializeMerge 
1. for each SA_bufu 
2. read first m start positions 
3. for each R_bufuv 
4. read first m/k LCP+partitionBit from Ruv 
from disk suffix array SAu 
3. for each R_bufuv 
4. read first m/k LCP+partitionBit from Ruv 
5. for each SA_bufu 
6. insert SA_bufu[0] into heap 
5. for each SA_bufu 
6. insert SA_bufu[0] into heap 
Fig. 5. The pseudocode for buffer allocation as the initial step for merge. 
Fig. 5. The pseudocode for buffer allocation as the initial step for merge.
CompareSuffix algorithm 
15 
Algorithm compareSuffix (Si from partition u, 
Sj from partition v) 
1. if (u = = v) 
2. return -1 //Si lex Sj, since they are sorted 
//in increasing order inside each partition 
Algorithm compareSuffix (Si from partition u, 
3. if (u  v) 
4. if (partitionBit in R_bufuv[Sj from partition v) 
1. if (u = = v) 
current pointer] = = 0) 
5. return 2. return -1 -1 //Si lex Sj, //Si since they are sorted 
lex Sj 
6. else 
//in increasing order inside each partition 
7. return1 //Si lex Sj 
8. if (u  v) 
9. if (partitionBit in R_bufvu[current pointer] = = 0) 
10. return 1 //Sj lex Si 
11. else 
12. return -1 //Sj lex Si 
3. if (u  v) 
4. if (partitionBit in R_bufuv[current pointer] = = 0) 
5. return -1 //Si lex Sj 
6. else 
7. return1 //Si lex Sj 
8. if (u  v) 
9. if (partitionBit in R_bufvu[current pointer] = = 0) 
10. return 1 //Sj lex Si 
11. else 
12. return -1 //Sj lex Si 
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays 
created during pairwise suffix sorting. 
Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting.
in all relevant order buffers. We insert into the heap the we continue in a similar way until all suffixes are merged. 
Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays 
AdvancePointers algorithm 
Pseudocode of current pointer advancing in suffix array buffer and the order 16 
created during pairwise suffix sorting. 
Algorithm advancePointers (partition ID u) 
1. SA_bufu.current_pointer++ 
2. if reached the end of SA_bufu 
3. refill SA_bufu from disk-based SAu 
4. for (i=0; iu; i++) 
5. R_bufiu.current_pointer++ 
6. if reached the end of R_bufiu 
7. refill R_bufiu from disk-based Riu 
8. for (i=u; ik; i++) 
9. R_bufui.current_pointer++ 
10. if reached the end of R_bufui 
11. refill R_bufui from disk-based Rui 
pointers for all the order array buffers containing information about partition u are also 
advanced by 1, as shown in pseudocode of Figure 7. This means we have determined the 
order of the current suffix of partition u, and we need to consider the next element both 
in SA BUFu and in all relevant order buffers. We insert into the heap the next suffix of 
partition u, and we continue in a similar way until all suffixes are merged. 
Algorithm advancePointers (partition ID u) 
1. SA_bufu.current_pointer++ 
2. if reached the end of SA_bufu 
3. refill SA_bufu from disk-based SAu 
4. for (i=0; iu; i++) 
5. R_bufiu.current_pointer++ 
6. if reached the end of R_bufiu 
7. refill R_bufiu from disk-based Riu 
8. for (i=u; ik; i++) 
9. R_bufui.current_pointer++ 
10. if reached the end of R_bufui 
11. refill R_bufui from disk-based Rui 
Fig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers.
ST 作成例例 
• X: ababc 
• SA: 02134 
• binary alphabet: a: 00, b: 01, c: 10 
• 本⼿手法では ST は binary alphabet で表現 
– 必ず 2 分⽊木になるので扱いやすい 
17
ST with binary alphabet 
18 
ST with alphabet 
ST with binary alphabet 
S4 
S4 
S0S2S1S3 
S0S2S1S3 
abbc 
abccabcc 
010 
0011 
0001101000011010 
index や LCP が bit 単位か 
byte (⽂文字) 単位かの違い
⽊木のどこに挿⼊入するか 
• SA の順で挿⼊入するので,必ず右端のパス 
内のどこかに internal node を挿⼊入 
• LCP を⾒見見てどこに挿⼊入するか決定 
19 
L1 
L1 
L1 
inserted 
internal node 
L1 
L1inserted 
leaf node 
L1
Insert s0 then s2 
20 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4) 
S0 
S0 
I0 
L4 
S2 
I0+4 I4+4 
I0 
IX: index is X (in unit of bit) 
LX: LCP is X (in unit of bit)
Insert s1 
21 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s2 (000110, I4, L4)Insert s1 (01000110, I2, L1) 
S0 
I0 
L4 
S2 
I4 I8 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I0+1 
L4-1 
S2 
I4 I8 
I0 
L1 
S1 
I2+1
Insert s3 
22 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s1 (01000110, I2, L1)Insert s3 (0110, I6, L2) 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I3+1 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I3 
I3 
L2-1 
S3 
I6+2
Insert s4 
23 
X: ababc 
Xb: 0001000110 
SA: 02134 
Insert s3 (0110, I6, L2)Insert s4 (10, I8, L0) 
Ix: index is x (in unit of bit) 
Lx: LCP is x (in unit of bit) 
S0 
I1 
L3 
S2 
I4 I8 
I0 
L1 
S1 
I4 
I3 
L1 
S3 
I8 
S0 
I1 
L3 
S2 
I4 I8 
I0+0 
L1-0 
S1 
I4 
I3 
L1 
S3 
I8 
I0 
L0 
S4 
I8+0
Index と LCP の意味 
24 
S0 
I1 
L3 
S2 
0 10 
I0 
L1 
I0+1+3 I4+1+3 
I3 
L1 
S1 
I2+1+1 
S3 
I6+1+1 
I0 
L0 
S4 
I8 
001 1 
000110 10 000110 
10 
Xb: 0001000110 
idx: 0123456789
ST 出⼒力力フォーマット 
• 3 つのデータ 
– Internal node ⽤用の array 
– Leaf node ⽤用の array 
– 最⼤大 suffix の prefix (実体): divider 
• divider を使って検索索すべきファイルを特定 
25 
Internal node structure (順番は不不明)Leaf node structure 
Left childRight childIndex 
4bytes4bytes4bytes 
partition id (1byte) 
Index 
4bytes 
partition id (1byte)
他⼿手法との⽐比較 
• TDD and ST-merge (2005年年) 
– 6MB メモリ 20MB データで 8 時間 
• Trellis+SB (2008年年) 
– 3GB データ 512MB メモリ で 11 時間 
– ランダムアクセス 95% 減 
• DiGeST (2008年年) 
– ランダムアクセス 98% 減 
• どれもこれも分割して処理理後 merge する 
• ディスクのランダムアクセスをいかに減らすか 
がポイント 26
TDD which works with uncompressed inputs. The results 評価: 他⼿手法との⽐比較 
Program Time, hours 
TDD 125 
B2ST 3 
• ⼊入⼒力力 3GB の DNA (1⽂文字1byte) 
• 使⽤用メモリ 600MB 
27 
Trellis+SB 11 
of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory. 
suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine. 
divided the 3GB into partitions of 1GB each and built
produced an intermediate on-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB. 
評価: ⼊入⼒力力サイズを変えて⽐比較 
Total 
time, 
min 
Merge, 
min 
Pairwise 
sorting, 
min 
Number of 
partition 
pairs 
Number 
of 
partitions 
Input 
size, 
GB 
6 3 3 441 27 468 
8 4 6 730 34 764 
10 5 10 1100 42 1142 
• ⼊入⼒力力データは DNA を複数結合したもの 
• メモリは 2GB 使⽤用 
• Pairwise sorting は k^2 で効いてくるが並列列化で 
きそう 
• Merge はほぼリニア 
28 
12 6 15 1462 59 1521 
Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory. 
example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers,
所感 
• もしかしてメモリのランダムアクセスよりも 
ディスクのシーケンシャルアクセスの⽅方がス 
ループット⼤大きかったりする??? 
• k を定数と⾒見見做せるケースでしか使えない 
– 実質的に partition id に 1 byte, suffix index に 4 byte 
の計 5 bytes (N  2^40) 程度度のデータが対象 
– Induce sort でやっているように再帰的に⽐比較して 
O(kN) à O(N) にできないものか 
• tail の件といい dividers の件といい,LCP が巨⼤大 
な場合をあまり考慮していない 
– 遺伝⼦子が対象ならそれで良良いが... 
29

More Related Content

Viewers also liked

10分で分かるデータストレージ
10分で分かるデータストレージ10分で分かるデータストレージ
10分で分かるデータストレージTakashi Hoshino
 
トランザクションの並行実行制御 rev.2
トランザクションの並行実行制御 rev.2トランザクションの並行実行制御 rev.2
トランザクションの並行実行制御 rev.2
Takashi Hoshino
 
Effective Modern C++ 勉強会#6 Item25
Effective Modern C++ 勉強会#6 Item25Effective Modern C++ 勉強会#6 Item25
Effective Modern C++ 勉強会#6 Item25
Takashi Hoshino
 
Effective Modern C++ 勉強会#8 Item38
Effective Modern C++ 勉強会#8 Item38Effective Modern C++ 勉強会#8 Item38
Effective Modern C++ 勉強会#8 Item38
Takashi Hoshino
 
トランザクションの並行処理制御
トランザクションの並行処理制御トランザクションの並行処理制御
トランザクションの並行処理制御
Takashi Hoshino
 
MySQLアーキテクチャ図解講座
MySQLアーキテクチャ図解講座MySQLアーキテクチャ図解講座
MySQLアーキテクチャ図解講座
Mikiya Okuno
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
Kumazaki Hiroki
 
トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法
Kumazaki Hiroki
 

Viewers also liked (8)

10分で分かるデータストレージ
10分で分かるデータストレージ10分で分かるデータストレージ
10分で分かるデータストレージ
 
トランザクションの並行実行制御 rev.2
トランザクションの並行実行制御 rev.2トランザクションの並行実行制御 rev.2
トランザクションの並行実行制御 rev.2
 
Effective Modern C++ 勉強会#6 Item25
Effective Modern C++ 勉強会#6 Item25Effective Modern C++ 勉強会#6 Item25
Effective Modern C++ 勉強会#6 Item25
 
Effective Modern C++ 勉強会#8 Item38
Effective Modern C++ 勉強会#8 Item38Effective Modern C++ 勉強会#8 Item38
Effective Modern C++ 勉強会#8 Item38
 
トランザクションの並行処理制御
トランザクションの並行処理制御トランザクションの並行処理制御
トランザクションの並行処理制御
 
MySQLアーキテクチャ図解講座
MySQLアーキテクチャ図解講座MySQLアーキテクチャ図解講座
MySQLアーキテクチャ図解講座
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
 
トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法
 

Similar to メモリより大きなデータの Sufix Array 構築方法の紹介

32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
Sanjeev Gupta
 
Solving ProportionsRead the following instructions in order to.docx
Solving ProportionsRead the following instructions in order to.docxSolving ProportionsRead the following instructions in order to.docx
Solving ProportionsRead the following instructions in order to.docx
rafbolet0
 
A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
unyil96
 
Range reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelRange reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernel
Davidlohr Bueso
 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
Akhil M Sreenath
 
presentation
presentationpresentation
presentation
Mikhail Dubov
 
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
AIST
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
x1 ichi
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential Patterns
IOSR Journals
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
Ieee Xpert
 
Boolean Algebra
Boolean AlgebraBoolean Algebra
Boolean Algebra
SwathiSundari
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
Ashis Kumar Chanda
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
Ashis Chanda
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
BBDITM LUCKNOW
 
Galgo f
Galgo fGalgo f
Socket++
Socket++Socket++
Socket++
berenguer1979
 
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
eeiej_journal
 
Sp 3828 10.1007-2_f978-3-642-31265-6_20
Sp 3828 10.1007-2_f978-3-642-31265-6_20Sp 3828 10.1007-2_f978-3-642-31265-6_20
Sp 3828 10.1007-2_f978-3-642-31265-6_20
Ali Akbar Dadjouyan
 
Gsm attacks
Gsm attacksGsm attacks
Gsm attacks
maicuong8
 
Exploitation Crash Course
Exploitation Crash CourseExploitation Crash Course
Exploitation Crash Course
UTD Computer Security Group
 

Similar to メモリより大きなデータの Sufix Array 構築方法の紹介 (20)

32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
Solving ProportionsRead the following instructions in order to.docx
Solving ProportionsRead the following instructions in order to.docxSolving ProportionsRead the following instructions in order to.docx
Solving ProportionsRead the following instructions in order to.docx
 
A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
Range reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernelRange reader/writer locking for the Linux kernel
Range reader/writer locking for the Linux kernel
 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
 
presentation
presentationpresentation
presentation
 
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
Mikhail Dubov - Text Analysis with Enhanced Annotated Suffix Trees: Algorithm...
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential Patterns
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Boolean Algebra
Boolean AlgebraBoolean Algebra
Boolean Algebra
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
 
Galgo f
Galgo fGalgo f
Galgo f
 
Socket++
Socket++Socket++
Socket++
 
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
 
Sp 3828 10.1007-2_f978-3-642-31265-6_20
Sp 3828 10.1007-2_f978-3-642-31265-6_20Sp 3828 10.1007-2_f978-3-642-31265-6_20
Sp 3828 10.1007-2_f978-3-642-31265-6_20
 
Gsm attacks
Gsm attacksGsm attacks
Gsm attacks
 
Exploitation Crash Course
Exploitation Crash CourseExploitation Crash Course
Exploitation Crash Course
 

More from Takashi Hoshino

Serializabilityとは何か
Serializabilityとは何かSerializabilityとは何か
Serializabilityとは何か
Takashi Hoshino
 
Isolation Level について
Isolation Level についてIsolation Level について
Isolation Level について
Takashi Hoshino
 
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
Takashi Hoshino
 
WalB Driver Internals
WalB Driver InternalsWalB Driver Internals
WalB Driver Internals
Takashi Hoshino
 
Effective Modern C++ 勉強会#1 Item3,4
Effective Modern C++ 勉強会#1 Item3,4Effective Modern C++ 勉強会#1 Item3,4
Effective Modern C++ 勉強会#1 Item3,4
Takashi Hoshino
 
WALをバックアップとレプリケーションに使う方法
WALをバックアップとレプリケーションに使う方法WALをバックアップとレプリケーションに使う方法
WALをバックアップとレプリケーションに使う方法
Takashi Hoshino
 
10分で分かるバックアップとレプリケーション
10分で分かるバックアップとレプリケーション10分で分かるバックアップとレプリケーション
10分で分かるバックアップとレプリケーションTakashi Hoshino
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤTakashi Hoshino
 
Intel TSX 触ってみた 追加実験 (TTAS)
Intel TSX 触ってみた 追加実験 (TTAS)Intel TSX 触ってみた 追加実験 (TTAS)
Intel TSX 触ってみた 追加実験 (TTAS)
Takashi Hoshino
 
Intel TSX HLE を触ってみた x86opti
Intel TSX HLE を触ってみた x86optiIntel TSX HLE を触ってみた x86opti
Intel TSX HLE を触ってみた x86opti
Takashi Hoshino
 
Suffix Array 構築方法の紹介
Suffix Array 構築方法の紹介Suffix Array 構築方法の紹介
Suffix Array 構築方法の紹介
Takashi Hoshino
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
ログ先行書き込みを用いたストレージ差分取得の一手法
ログ先行書き込みを用いたストレージ差分取得の一手法ログ先行書き込みを用いたストレージ差分取得の一手法
ログ先行書き込みを用いたストレージ差分取得の一手法
Takashi Hoshino
 
Intel TSX について x86opti
Intel TSX について x86optiIntel TSX について x86opti
Intel TSX について x86opti
Takashi Hoshino
 
WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.
Takashi Hoshino
 
VMware Backup in Cybozu Labs
VMware Backup in Cybozu LabsVMware Backup in Cybozu Labs
VMware Backup in Cybozu Labs
Takashi Hoshino
 
Vmbkp: VMware vSphere Incremental Backup Tool
Vmbkp: VMware vSphere Incremental Backup ToolVmbkp: VMware vSphere Incremental Backup Tool
Vmbkp: VMware vSphere Incremental Backup Tool
Takashi Hoshino
 
Inside database
Inside databaseInside database
Inside database
Takashi Hoshino
 

More from Takashi Hoshino (19)

Serializabilityとは何か
Serializabilityとは何かSerializabilityとは何か
Serializabilityとは何か
 
Isolation Level について
Isolation Level についてIsolation Level について
Isolation Level について
 
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
データベースシステムにおける直列化可能性と等価な時刻割り当てルールの提案 rev.3
 
WalB Driver Internals
WalB Driver InternalsWalB Driver Internals
WalB Driver Internals
 
Effective Modern C++ 勉強会#1 Item3,4
Effective Modern C++ 勉強会#1 Item3,4Effective Modern C++ 勉強会#1 Item3,4
Effective Modern C++ 勉強会#1 Item3,4
 
WALをバックアップとレプリケーションに使う方法
WALをバックアップとレプリケーションに使う方法WALをバックアップとレプリケーションに使う方法
WALをバックアップとレプリケーションに使う方法
 
WalBの紹介
WalBの紹介WalBの紹介
WalBの紹介
 
10分で分かるバックアップとレプリケーション
10分で分かるバックアップとレプリケーション10分で分かるバックアップとレプリケーション
10分で分かるバックアップとレプリケーション
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ
 
Intel TSX 触ってみた 追加実験 (TTAS)
Intel TSX 触ってみた 追加実験 (TTAS)Intel TSX 触ってみた 追加実験 (TTAS)
Intel TSX 触ってみた 追加実験 (TTAS)
 
Intel TSX HLE を触ってみた x86opti
Intel TSX HLE を触ってみた x86optiIntel TSX HLE を触ってみた x86opti
Intel TSX HLE を触ってみた x86opti
 
Suffix Array 構築方法の紹介
Suffix Array 構築方法の紹介Suffix Array 構築方法の紹介
Suffix Array 構築方法の紹介
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
ログ先行書き込みを用いたストレージ差分取得の一手法
ログ先行書き込みを用いたストレージ差分取得の一手法ログ先行書き込みを用いたストレージ差分取得の一手法
ログ先行書き込みを用いたストレージ差分取得の一手法
 
Intel TSX について x86opti
Intel TSX について x86optiIntel TSX について x86opti
Intel TSX について x86opti
 
WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.
 
VMware Backup in Cybozu Labs
VMware Backup in Cybozu LabsVMware Backup in Cybozu Labs
VMware Backup in Cybozu Labs
 
Vmbkp: VMware vSphere Incremental Backup Tool
Vmbkp: VMware vSphere Incremental Backup ToolVmbkp: VMware vSphere Incremental Backup Tool
Vmbkp: VMware vSphere Incremental Backup Tool
 
Inside database
Inside databaseInside database
Inside database
 

Recently uploaded

Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
Ayan Halder
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 

Recently uploaded (20)

Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 

メモリより大きなデータの Sufix Array 構築方法の紹介

  • 1. メモリより⼤大きいデータの Suffix Tree 構築⽅方法の紹介 2014-08-29 星野 喬 1
  • 2. 概要 • 主に以下の論論⽂文の紹介 • Suffix trees for inputs larger than main memory – Marina Barsky, Ulrike Stege and Alex Thomo – Information Systems archive Volume 36 Issue 3, May, 2011 – CIKM2009 での short paper の発展版 2
  • 3. 論論⽂文の主張 • メモリより⼤大きいデータの suffix tree の 構築⽅方法の提案 • 既存の構築⽅方法と違い,ディスクアクセ スはほぼシーケンシャルアクセスのみ 3
  • 4. Suffix tree とは (復復習) • ⾼高速な⽂文字列列操 作のための suffix を⽤用いた⽊木構造 – 葉葉ノード: suffix を表す – 節ノード: suffixes 間の共通 prefix を表す 4 most N −1 internal nodes in the tree. Hence, the maximum number linear in N. The tree’s total space is linear in N in the case that each stored in a constant space. Fortunately, this is the case for an implicit substrings by their positions. R a b a b c 0 1 2 3 4 0 ab abc 1-1 b 4-4 abc 1 c 4 3 c c 2 0-1 2-4 4-4 2-4 4-4 = ababc. For clarity, the explicit edge labels are shown, which are represented as in the actual suffix tree. Each suffix Si can be found by concatenating substrings
  • 5. 提案アルゴリズム: B2ST • Big tree, Big string Suffix Tree construction algorithm • 3 つのフェーズ – (1) Input partitioning – (2) Pair-wise sorting – (3) Merging 5
  • 6. 変数の意味 • X: ⼊入⼒力力⽂文字列列 (⻑⾧長さN) • Xij: X[i]からX[j-1]までの部分⽂文字列列 • Si: i 番⽬目の suffix (0 = i N) • LCPij: Si と Sj の longest common prefix • |LCPij|: LCPij の⻑⾧長さ • SA: suffix array • ST: suffix tree 6
  • 7. partition, and its positions are not included into the suffix array of the partition, positions are not included into the suffix array of the partition. This ensures suffix Input partitioning Si of a partition Xu (0 ≤ i |Xu|, 0 ≤ u k) will be sorted as representative of a suffix Sup+i of X. In practice, for real-life DNA sequences, the tail is negligibly small compared to the size of the partition itself (it never exceeded • X を k 個の Xu に分割 (0 = u k) • k = 2N/M (M: メモリサイズ) • 各 Xu は tail を持つ – Xu 内での suffix sorting 時の区別のため – tail と同じ部分⽂文字列列が含まれないように – 遺伝⼦子だと⾼高々 1000 くらい characters in our experiments with DNA databases). Consider the example in Figure 2. It shows the partitioning of input string ababaaabbabbbabaabab into four partitions. The main memory can hold up to 16 the input at a time. To facilitate the example, we represent our input as stands for 0-bit, b stands for 1-bit). In this illustration we also refer to the as A, B, C, and D in order to distinguish them from the numbers representing positions. Note that the tail of partition XB is substring bbb, which never = aabba. Partition A Partition B Partition C Partition D a b a b a a a b b a b b b a b a a b a b 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Input string X = ababaaabbabbbabaabab and its four partitions. The tail of partition B 7is
  • 8. Pair-wise sorting • 全ての XuXv について SA を LCP 付きで構 築 (SAuv) – XuXv: Xu と Xv を結合したもの (u v) – オンメモリで処理理できるように k を調整済 – SA 内で隣隣り合う suffix 同⼠士の LCP も保存 • 出⼒力力 – SAu: Xu の SA – Ruv: SAuv 内の LCP と uv を表す bit の列列 8
  • 9. In this step we generate suffix arrays for each pair of partitions. The pseudocode for this step is shown in Figure 3. We concatenate every possible pair u, v of partitions with their tails (0 ≤ u k − 1, u + SAu 1 ≤ v k, main memory and build the suffix array と u v) into Ruv string XuXv. のWe 例例load this input into the SAuv with attached LCP length information for each suffix. The suffixes which start in tail positions are excluded from the output suffix array, they serve only for determining the relative order of suffixes starting at the end of each partition. An LCP entry of SAuv is the length of the longest common prefix of each suffix in SAuv with its immediate predecessor. Figure • SAu 4 shows size: such an 4N/array k for + the α pair A, B of partitions for the same input string as in Figure 2. From each SAuv, we extract – N/k 2^32 が前提 • Ruv size: 8N/k + α • SAu は k 個 • Ruv は k(k-1)/2 個 • 計 4kN + α: O(kN) 9 for determining the relative order of suffixes starting at the end of LCP entry of SAuv is the length of the longest common prefix of each immediate predecessor. Figure 4 shows such an array for the pair the same input string as in Figure 2. From each SAuv, we extract SAAB (in memory) suffix start LCP partition bit 5 1 3 1 2 5 4 2 4 3 0 2 1 3 2 3 0 2 3 1 A B A A B B A A B B RAB SAAB (in memory) suffix start LCP partition bit 5 1 3 1 2 5 4 2 4 3 0 2 1 3 2 3 0 2 3 1 A B A A B B A A B B LCP 0 2 1 3 2 3 0 2 3 1 partition bit A B A A B B A A B B SAA 5 3 1 4 2 RAB LCP 0 2 1 3 2 3 0 2 3 1 partition bit A B A A B B A A B B SAA written 5 3 to 1 disk 4 2 written to disk with LCP information for a pair of partitions A and B. Two structures are the suffix array of partition A and (2) the order array RAB storing the relative B. These two structures are written to disk. Fig. 4. Suffix array SAAB with LCP information for a pair of partitions A and B. Two structures are extracted from SAAB: (1) the suffix array of partition A and (2) the order array RAB storing the relative order of suffixes in A and B. These two structures are written to disk.
  • 10. Pair-wise sorting pseudo code 10 Algorithm pairwiseSorting input: k partitions of string X 1. for Algorithm (u=0; upairwiseSorting k-1; i++) 2. for input: (v=1; k partitions vk; v++) of string X 3. 1. concatenate for (u=0; uk-1; 2. for (v=1; vk; XuXv i++) v++) and load into RAM 4. 3. build concatenate suffix array XuXv with and load LCP into SAuv RAM 5. 4. during build sequential suffix array scan with LCP of SAuv SAuv 6. 5. if during v==k-sequential 1 //last scan chunk of SAuv 6. output if v==k-1 to //last chunk 7. 7. output to disk disk SAv SAv 8. 8. output output to to disk disk SAu SAu 9. 9. output output to to disk disk Ruv Ruv //order //order array array Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. Fig. 3. Algorithm for pairwise sorting of suffixes in all partition pairs. Step 2: Suffix sorting in partition pairs
  • 11. Merging • 2 つの部品で構成 – SAu と Ruv のストリームから SA のストリー ムを作る – SA のストリームから ST を作る • オンメモリデータ – SA_buf_u: SAu ⼊入⼒力力バッファ(size m, k 個) – R_buf_uv: Ruv ⼊入⼒力力バッファ(size 2m/k, k(k-1)/2 個) – ST_buf: ST 出⼒力力バッファ (多分 size M-km) 11
  • 12. SA ストリーム作成 12 SAu ... Ruv ... HeapSA with LCP Heap 構造を使って SAu (0 = u k) の中で⼀一番⼩小さい suffix を持つ u を 選んで出⼒力力
  • 13. 5. Merge algorithm 13 What happens with each suffix in the output buffer is the subject pseudocode of merge is shown in Figure 8. Algorithm merge 1. lastTransferred = null 2. while heap is not empty 3. remove smallest suffix Si of partition u from the top of the heap 4. rebalance heap 5. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 4. rebalance heap 5. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 15. lastTransferred = null 16. Sj = get next suffix from SA_bufu 17. if Sj is not null 18. insert Sj into heap Fig. 8. The general pseudocode for merge. lcp = 0 6. if lastTransferred is not null 7. v = lastTransferred.partitionID 8. lcp = LCP from R_bufuv [current_pointer] 9. create leaf for Si using lcp in ST_buf 10. advancePointers (u) 11. lastTransferred=Si 12. if ST_buf is full 13. store Si (max suffix) as a pointer to the current tree 14. write ST_buf to disk 15. lastTransferred = null 16. Sj = get next suffix from SA_bufu 17. if Sj is not null 18. insert Sj into heap Fig. 8. The general pseudocode for merge.
  • 14. the information we need to efficiently perform the merge. As produce information. Since each Ruv contains an information only about two partitions, we need to the use suffix one bit to tree represent for the the partition entire ID input in Ruv. string Specifically, X. we We use are 0 for doing u entire v (input u v). Figure string InitialMerge 4 shows into SAA main and memory. RAB extracted algorithmIn from fact, SAAB we for never At the end of this step we have on disk suffix arrays for partitions partition access pair (k k (of total size plus k(k − 1)/2 order arrays for each possible pair of partitions (of total size kN). Algorithm initializeMerge This is all the information we need to efficiently perform the merge. As a result of merge we produce the suffix tree for the entire input string X. We are doing this without loading the entire input string into main memory. In fact, we never access X anymore. 14 1. for each SA_bufu 2. read first m start positions from disk suffix array SAu Algorithm initializeMerge 1. for each SA_bufu 2. read first m start positions 3. for each R_bufuv 4. read first m/k LCP+partitionBit from Ruv from disk suffix array SAu 3. for each R_bufuv 4. read first m/k LCP+partitionBit from Ruv 5. for each SA_bufu 6. insert SA_bufu[0] into heap 5. for each SA_bufu 6. insert SA_bufu[0] into heap Fig. 5. The pseudocode for buffer allocation as the initial step for merge. Fig. 5. The pseudocode for buffer allocation as the initial step for merge.
  • 15. CompareSuffix algorithm 15 Algorithm compareSuffix (Si from partition u, Sj from partition v) 1. if (u = = v) 2. return -1 //Si lex Sj, since they are sorted //in increasing order inside each partition Algorithm compareSuffix (Si from partition u, 3. if (u v) 4. if (partitionBit in R_bufuv[Sj from partition v) 1. if (u = = v) current pointer] = = 0) 5. return 2. return -1 -1 //Si lex Sj, //Si since they are sorted lex Sj 6. else //in increasing order inside each partition 7. return1 //Si lex Sj 8. if (u v) 9. if (partitionBit in R_bufvu[current pointer] = = 0) 10. return 1 //Sj lex Si 11. else 12. return -1 //Sj lex Si 3. if (u v) 4. if (partitionBit in R_bufuv[current pointer] = = 0) 5. return -1 //Si lex Sj 6. else 7. return1 //Si lex Sj 8. if (u v) 9. if (partitionBit in R_bufvu[current pointer] = = 0) 10. return 1 //Sj lex Si 11. else 12. return -1 //Sj lex Si Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays created during pairwise suffix sorting. Algorithm for suffix comparison which uses the pairwise suffix information from the order during pairwise suffix sorting.
  • 16. in all relevant order buffers. We insert into the heap the we continue in a similar way until all suffixes are merged. Fig. 6. Algorithm for suffix comparison which uses the pairwise suffix information from the order arrays AdvancePointers algorithm Pseudocode of current pointer advancing in suffix array buffer and the order 16 created during pairwise suffix sorting. Algorithm advancePointers (partition ID u) 1. SA_bufu.current_pointer++ 2. if reached the end of SA_bufu 3. refill SA_bufu from disk-based SAu 4. for (i=0; iu; i++) 5. R_bufiu.current_pointer++ 6. if reached the end of R_bufiu 7. refill R_bufiu from disk-based Riu 8. for (i=u; ik; i++) 9. R_bufui.current_pointer++ 10. if reached the end of R_bufui 11. refill R_bufui from disk-based Rui pointers for all the order array buffers containing information about partition u are also advanced by 1, as shown in pseudocode of Figure 7. This means we have determined the order of the current suffix of partition u, and we need to consider the next element both in SA BUFu and in all relevant order buffers. We insert into the heap the next suffix of partition u, and we continue in a similar way until all suffixes are merged. Algorithm advancePointers (partition ID u) 1. SA_bufu.current_pointer++ 2. if reached the end of SA_bufu 3. refill SA_bufu from disk-based SAu 4. for (i=0; iu; i++) 5. R_bufiu.current_pointer++ 6. if reached the end of R_bufiu 7. refill R_bufiu from disk-based Riu 8. for (i=u; ik; i++) 9. R_bufui.current_pointer++ 10. if reached the end of R_bufui 11. refill R_bufui from disk-based Rui Fig. 7. Pseudocode of current pointer advancing in suffix array buffer and the order buffers.
  • 17. ST 作成例例 • X: ababc • SA: 02134 • binary alphabet: a: 00, b: 01, c: 10 • 本⼿手法では ST は binary alphabet で表現 – 必ず 2 分⽊木になるので扱いやすい 17
  • 18. ST with binary alphabet 18 ST with alphabet ST with binary alphabet S4 S4 S0S2S1S3 S0S2S1S3 abbc abccabcc 010 0011 0001101000011010 index や LCP が bit 単位か byte (⽂文字) 単位かの違い
  • 19. ⽊木のどこに挿⼊入するか • SA の順で挿⼊入するので,必ず右端のパス 内のどこかに internal node を挿⼊入 • LCP を⾒見見てどこに挿⼊入するか決定 19 L1 L1 L1 inserted internal node L1 L1inserted leaf node L1
  • 20. Insert s0 then s2 20 X: ababc Xb: 0001000110 SA: 02134 Insert s0 (0001000110, I0, L0)Insert s2 (000110, I4, L4) S0 S0 I0 L4 S2 I0+4 I4+4 I0 IX: index is X (in unit of bit) LX: LCP is X (in unit of bit)
  • 21. Insert s1 21 X: ababc Xb: 0001000110 SA: 02134 Insert s2 (000110, I4, L4)Insert s1 (01000110, I2, L1) S0 I0 L4 S2 I4 I8 Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I0+1 L4-1 S2 I4 I8 I0 L1 S1 I2+1
  • 22. Insert s3 22 X: ababc Xb: 0001000110 SA: 02134 Insert s1 (01000110, I2, L1)Insert s3 (0110, I6, L2) Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I1 L3 S2 I4 I8 I0 L1 S1 I3+1 S0 I1 L3 S2 I4 I8 I0 L1 S1 I3 I3 L2-1 S3 I6+2
  • 23. Insert s4 23 X: ababc Xb: 0001000110 SA: 02134 Insert s3 (0110, I6, L2)Insert s4 (10, I8, L0) Ix: index is x (in unit of bit) Lx: LCP is x (in unit of bit) S0 I1 L3 S2 I4 I8 I0 L1 S1 I4 I3 L1 S3 I8 S0 I1 L3 S2 I4 I8 I0+0 L1-0 S1 I4 I3 L1 S3 I8 I0 L0 S4 I8+0
  • 24. Index と LCP の意味 24 S0 I1 L3 S2 0 10 I0 L1 I0+1+3 I4+1+3 I3 L1 S1 I2+1+1 S3 I6+1+1 I0 L0 S4 I8 001 1 000110 10 000110 10 Xb: 0001000110 idx: 0123456789
  • 25. ST 出⼒力力フォーマット • 3 つのデータ – Internal node ⽤用の array – Leaf node ⽤用の array – 最⼤大 suffix の prefix (実体): divider • divider を使って検索索すべきファイルを特定 25 Internal node structure (順番は不不明)Leaf node structure Left childRight childIndex 4bytes4bytes4bytes partition id (1byte) Index 4bytes partition id (1byte)
  • 26. 他⼿手法との⽐比較 • TDD and ST-merge (2005年年) – 6MB メモリ 20MB データで 8 時間 • Trellis+SB (2008年年) – 3GB データ 512MB メモリ で 11 時間 – ランダムアクセス 95% 減 • DiGeST (2008年年) – ランダムアクセス 98% 減 • どれもこれも分割して処理理後 merge する • ディスクのランダムアクセスをいかに減らすか がポイント 26
  • 27. TDD which works with uncompressed inputs. The results 評価: 他⼿手法との⽐比較 Program Time, hours TDD 125 B2ST 3 • ⼊入⼒力力 3GB の DNA (1⽂文字1byte) • 使⽤用メモリ 600MB 27 Trellis+SB 11 of different suffix tree construction algorithms for approximately which is larger than the total allocated main memory. suffix tree for the above 3GB input in 125 hours. of Trellis+SB reported in [20]. The value in Figure similar settings on a comparable machine. divided the 3GB into partitions of 1GB each and built
  • 28. produced an intermediate on-disk output of size 234GB. Despite this, the merge in only 59 minutes, scanning all this on-disk data in sequential manner 2514 suffix tree files of total 215GB. 評価: ⼊入⼒力力サイズを変えて⽐比較 Total time, min Merge, min Pairwise sorting, min Number of partition pairs Number of partitions Input size, GB 6 3 3 441 27 468 8 4 6 730 34 764 10 5 10 1100 42 1142 • ⼊入⼒力力データは DNA を複数結合したもの • メモリは 2GB 使⽤用 • Pairwise sorting は k^2 で効いてくるが並列列化で きそう • Merge はほぼリニア 28 12 6 15 1462 59 1521 Running time (min) of B2ST for different sets (approximately 6, 8, 10 and 12GB) of genomic 2GB of main memory. example shows that we need a large temporary disk space for scaling up the Specifically, we need D = k2p = kN disk space to store the order arrays partition pairs. Since the number of partitions is k = N/M, from D = N2/M the size of the largest input that we can process with M bytes of internal bytes of disk space. If we substitute the common values for modern computers,
  • 29. 所感 • もしかしてメモリのランダムアクセスよりも ディスクのシーケンシャルアクセスの⽅方がス ループット⼤大きかったりする??? • k を定数と⾒見見做せるケースでしか使えない – 実質的に partition id に 1 byte, suffix index に 4 byte の計 5 bytes (N 2^40) 程度度のデータが対象 – Induce sort でやっているように再帰的に⽐比較して O(kN) à O(N) にできないものか • tail の件といい dividers の件といい,LCP が巨⼤大 な場合をあまり考慮していない – 遺伝⼦子が対象ならそれで良良いが... 29