2009 CSBB LAB 新生訓練
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

2009 CSBB LAB 新生訓練

  • 2,718 views
Uploaded on

計概。黃智沂主講

計概。黃智沂主講

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,718
On Slideshare
2,678
From Embeds
40
Number of Embeds
4

Actions

Shares
Downloads
14
Comments
0
Likes
1

Embeds 40

http://algorithm.cs.nthu.edu.tw 34
http://www.slideshare.net 3
http://www.slideee.com 2
http://140.114.88.89 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • We usually use the matrix to hold the cost of the edges in the bipartite graph. First we need a matrix of the costs of the workers doing the jobs.
  • Let us now talk about a more sophisticated data structure : Range Trees. The 1-D case is straightforward.. Even a sorted list of the points would suffice. But a sorting wouldn’t generalize to higher dimensions, so we use binary trees instead. Build a perfectly balanced binary tree on the sorted list of points.. Input points r stored in leaves (all leaves are linked in a list), each internal node stores the highest value in it’s left subtree. Comparing the query boundary with this value will help us reach the first point falling in the query. Consider the following example … Query time is O(log n + k) for reporting case, and O(log n) for counting.

Transcript

  • 1. CSBB LAB 新生訓練 基礎計算機科學 Speaker: 黃智沂 2 nd Year Student of Ph.D. Program
  • 2. 感謝
    • 感謝網路上的眾多投影片作者,本投影片由不才編譯而成。
  • 3. 為什麼要談基礎計算機科學?
    • 因為不是每個人都是資工出身
    • 資工人也可能觀念不清
    • 大家討論時可能會常常提到
    • 因為可能有用
    • 因為這是學術界
    • 因為你們要參加
      • LEAST Seminar .
  • 4. 李家同師公名言錄
    • 基礎最重要,不要老是想要搞很難的東西。
    • 研究生, default 就是 24 小時待在學校做研究。
  • 5. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 人工智慧與機械學習
      • 進階資料結構
  • 6. 何謂演算法
    • 演算法的特徵
      • 輸入 :一個演算法必須有零個或以上輸入量。
      • 輸出 :一個演算法應有一個或以上輸出量,輸出量是演算法計算的結果。
      • 明確性 :演算法的描述必須無歧義,以保證演算法的實際執行結果是精確地符合要求或期望,通常要求實際執行結果是確定的。
      • 有限性:一個演算法是能夠被任何系統類比的一串運算,演算法必須在有限個步驟內完成任務。
      • 有效性:又稱可行性。能夠實作,演算法中描述的操作都是可以透過已經實作的基本運算執行有限次來實作。
  • 7. 介紹
    • 演算法分析的目的:估計演算法的效能
    • 演算法系統千奇百怪,我們多半無法精確地預測演算法的行為
    • 為了便於分析,將定義一些最重要的參數和評估標準
    • 只做到 近似的 分析,而非完美的分析
  • 8. O 表示法
    • 在評估演算法的速度,常忽略常數。可以使用 O 表示法來表示
    • 例如:
    • 5n 2 + 15 = O(n 2 )
    • O 表示法是代表 上限 ,因此
    • 5n 2 + 15 = O(n 3 ) 也是對的
  • 9. O 表示法
    • O 表示法可以很方便的捨棄常數
    • O(n) = O(5n+4)
    • O( log n) 不需要寫底數
    • 常數上限的表示法為 O(1)
  • 10. O 表示法
    • 可在方程式中使用 O 表達定量的概念
    • 例如:
    • T(n) = 3n 2 + O(n)
    • S(n) = 2n log 2 n + 5n + O(1)
  • 11. 定理
    • 定理
    • 指數函數成長得比多項式函數還快
    • 多項式函數成長得比對數函數還快
    • 例如
    • O(n) = O(2 n )
    • O(n 2 ) = O(2 n )
    • O(n 3 ) = O(2 n )
    • O(n 99 ) = O(2 n )
  • 12. 引理
    • O 表示法的加法與乘法是成立的
    • 例如
    • n 3 + n 2 = O(n 3 + n 2 ) = O(n 3 )
    • n 3 * n 2 = O(n 3 * n 2 ) = O(n 5 )
    • 但是除法與減法是錯的
  • 13.
    • n=1000 時,不同電腦 & 演算法的執行時間
    10 38 10 38 10 39 10 39 1.1 n 125,000 250,000 500,000 1,000,000 n 3 125 250 500 1,000 n 2 4 8 16 32 n 1.5 1.25 2.5 5 10 nlog 2 n 0.125 0.25 0.5 1 n 0.001 0.003 0.005 0.010 log 2 n 執行時間 時間 4 8000 步驟 / 秒 時間 3 4000 步驟 / 秒 時間 2 2000 步驟 / 秒 時間 1 1000 步驟 / 秒
  • 14. O 表示法
    • O 表示法是用來指出演算法的上限
    • 課本中所有演算法的執行時間,上限都是 O(2 n )
    • 意思是說,它們所需要的執行時間,都 不會超過 指數時間
    • 但是 O(2 n ) 是 非常粗略 的估計,實際上這些演算法通常都可以作得比 O(2 n ) 快很多
  • 15. O 表示法
    • 我們感興趣的不只是上限,而是一道盡可能貼近實際執行時間的方程式
    • 很難求得方程式的話,至少估計上下限
    • 求下限比求上限困難多了
  • 16. 上限 & 下限 演算法最短的執行時間 慢 快 這是某個演算法 因此,這問題的解法 最慢不過如此了 ( 上限 ) 所有的演算法 都不可能更快了 ( 下限 )
  • 17. Ω 表示法
    • Ω 表示法 – 演算法的下限
    • 和 O 一樣, Ω 表示法 也可忽略常數
    • 例如:
    • n 2 – 100 = Ω (n 2 )
    • 因為是表示下限,所以
    • n 2 = Ω (n)
    • Ω (n) 對應的關係是 “大於等於”
  • 18. Θ 表示法
    • 如果上下限一樣大的時候,就表示精確找出實際的執行速度
    • 若:
    • f(n) = O(n) ,且  上限 ( 小於等於 )
    • f(n) = Ω (n)  下限 ( 大於等於 )
    • 則我們就用 Θ 表示法:
    • f(n) = Θ (n)  等於
  • 19. 時間與空間複雜度
    • 如何不用執行演算法,就知道演算法的執行時間?
    • 方法:統計演算法執行所需的指令個數
    • 但是演算法中可能有好幾種不同的指令
    • 每一種指令所耗費的時間也不同
    • 例如:除法比加法慢
  • 20. 空間複雜度
    • 空間複雜度 (space complexity) 指的是執行演算法時所需的儲存空間
    • 和時間複雜度一樣,空間複雜度考慮的也是最差情況 (worst case)
    • 如果演算法的 空間複雜度為 O(n) ,表示每個輸入元素都分配到固定的儲存空間。如果 空間複雜度是 O(1) ,代表演算法需要固定的儲存空間,與輸入量無關
  • 21. 空間複雜度 c c c c c O(1) 100K 100,000 10K 10,000 1K 1,000 100 100 10 10 O(n) 輸入量 n
  • 22. 複雜度的替換代價
    • 空間複雜度與時間複雜度的替換代價
      • 使用 O(n) TIME 的演算法是否一定需要用到 O(n) SPACE?
      • 使用 O(n) SPACE 的演算法是否一定需要用到 O(n) TIME?
  • 23. 進階的複雜度分析
    • Amortized Complexity
    • Average-Case Complexity
    • Combinatorial Complexity
    • Knowledge Complexity
    • Free-Bits Complexity
    • ….etc
  • 24. 參考書目
    • Introduction to algorithmsT.Cormen, C.Leiserson & R.L.Rivest
  • 25. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 人工智慧與機械學習
      • 進階資料結構
  • 26. 分析和計算模型有關
    • 如何分析?
      • 你採用什麼模型?
        • Circuit?
        • Turing Machine?
        • Counter Machine?
        • Pointer Machine?
        • Lambda Calculus?
  • 27. What is a Turing Machine?
    • Control is similar to (but not the same as) DFA
    • It has an infinite tape as memory
    • A tape head can read and write symbols and move around the tape
    • Initially, tape contains input string (on the leftmost end) and is blank everywhere
    control  = blank symbols     b a b a
  • 28. What is a TM? (2)
    • Finite number of states: one for immediate accept , one for immediate reject , and others for continue
    • Based on the current state and the tape symbol under the tape head, TM then decides the tape symbol to write on the tape, goes to the next state, and moves the tape head left or right
    • When TM enters accept state, it accepts the input immediately ; when TM enters reject state, it rejects the input immediately
    • If it does not enter the accept or reject states, TM will run forever , and never halt
  • 29. Extensions of Turing Machine
    • However, none of the facilities will increase the power of TM.
    • Church-Turing Thesis:
    • TM are the ultimate computational devices.
    • (TM = algorithms)
  • 30. Variants of TM
    • Multiple tapes (2-tape machines).
    • Multiple heads.
    • Two-way tape.
    • Random access memory.
    • Two-dimensional memory.
    • Oracle Turing Machine
      • Random Turning Machine
  • 31. Multi-tape Turing Machines: Informal Description … We add a finite number of tapes Control … a 1 a 2  Tape 1 head 1 … a 1 a 2  Tape 2 head 2
  • 32. Multi-tape Turing Machines: Informal Description (II)
    • Each tape is bounded to the left by a cell containing the symbol 
    • Each tape has a unique header
    • Transitions have the form (for a 2-tape Turing machine):
    ( (p,(x 1 , x 2 )), (q, (y 1 , y 2 )) ) Such that each x i is in  and each y i is in  or is  or  . and if x i =  then y i =  or y i = 
  • 33. Multi-tape Turing Machines vs Turing Machines
    • Is Multi-tape TM stronger than TM?
    • Consider the problem:
      • Does string A equal to string B?
  • 34. a b  a b  Tape 1 a b a Tape 2 State in M2: s Solve by 2-tape Turing Machine M2 : a b  a b  Tape 1 a b a Tape 2 State in M2: s’
  • 35. Using States to “Remember” Information Equivalent configuration in a Turing Machine M : a b  a b a b # a b
  • 36. Theorem
    • Expression power of TM equals to Expression power of Multi-Tape TM
    • Q: Is Multi-Tape TM faster than one tape TM?
  • 37. Oracle Turing Machine
    • An oracle is a black box. You can consider it as a special device (machine).
    • An oracle of X is a black box which can answer any instance of the problem X in O(1) time.
    • An oracle machine is a Turing machine connected to an oracle . Thus Oracle Turing Machine is also a multi-tape TM.
  • 38. Oracle Turing Machine
    • We can extend it in many ways, e.g., to devise a TM which can run Randomized Quick-Sort.
      • Use Oracle to flip the coin.
      • Or use Oracle to generate random bit sequence.
  • 39. Definition: A Non-Deterministic TM is a 7-tuple T = (Q, Σ , Γ ,  , q 0 , q accept , q reject ), where: Q is a finite set of states Γ is the tape alphabet, where   Γ and Σ  Γ q 0  Q is the start state Σ is the input alphabet, where   Σ  : Q  Γ -> Pow( Q  Γ  {L,R}) q accept  Q is the accept state q reject  Q is the reject state, and q reject  q accept
  • 40. Acceptance for NTM
    • If w is in L:
    • There are some computations leading the machine into the acceptance configuration .
    • If w is NOT in L:
    • The machine always rejects the string.
  • 41. Non-Deterministic TM is a Parallel Universe
  • 42. the set of languages decided by a O(t(n))-time non-deterministic Turing machine. Definition: NTIME(t(n)) is TIME(t(n))  NTIME(t(n))
  • 43. NTM vs. DTM
    • Theorem:
    • Non-deterministic Turing machines can be converted into deterministic Turing Machines.
    • NTM = DTM
  • 44. Deterministic Polynomial Time P = TIME(n k )  k  N
  • 45. Non-deterministic Polynomial Time NP = NTIME(n k )  k  N
  • 46. NTM 太抽象?
    • Another Model for NP.
      • Karp’s proof system.
  • 47. Theorem: L  NP if and only if there exists a poly-time Turing machine V with L = { x |  y. |y| = poly(|x|) and V(x,y) accepts } . Proof:
    • If L = { x |  y. |y| = poly(|x|) and V(x,y) accepts }
    • then L  NP.
    Because we can guess y and then run V. (2) If L  NP then L = { x |  y. |y| = poly(|x|) and V(x,y) accepts } Let N be a non-deterministic poly-time TM that decides L. Define V(x,y) to accept if y is an accepting computation history of N on x.
  • 48. A language is in NP if and only if there exist polynomial-length certificates for membership to the language. SAT is in NP because a satisfying assignment is a polynomial-length certificate that a formula is satisfiable.
  • 49. NP-Complete
    • 如果 L 是 NP-Hard ,則對任何問題 L’ 屬於 NP ,都可以將 L’ reduce to L 。
    • 如果 L 是 NP-Complete ,代表 L 屬於 NP-Hard 且 L 是一個 NP 問題。
    • 我們說一個問題 L’ 可以 reduce to L ,代表 L 的難度至少跟 L’ 一樣。
    • 我們知道存在很多 NP-C 的問題,像 SAT , TSP , Bin-Packing, Knapsack… 等的問題。
  • 50. Karp’s Complete Problems
  • 51. The World by Karp P 2-SAT, Shortest-Path, Minimum-Cut, Arc-Cover ? NP-Hard NP-Complete SAT, Clique, Hamiltonian-Circuit, Chromatic Number . . . Equivalence of Regular Expression, Equivalence of ND Finite Automata, Context Sensitive Recognition Linear-Inequalities Graph-Isomorphism, Non-Primes NP ? ? In NPC In P
  • 52. NP vs. P 這代表什麼?
  • 53. How to use NPC? NP = The set of all the problems for which you can verify an alleged solution in polynomial time.
  • 54. 最好的狀況當然是證明沒有好的方法存在。例如知名的 Sorting Problem 的 Lower Bound 是 O(n lg n).
  • 55. 可是這通常比找出演算法更難
  • 56.  
  • 57.  
  • 58. Reference
    • Computers and Intractablity:
    • A guide to the Theory of NP-completeness
    • by Mike Garey and David Johnson
  • 59. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 人工智慧與機械學習
      • 進階資料結構
  • 60. 解決問題的層次 ( 一 )
    • Heuristics
      • 如果我們找不到最佳解或者也沒法證明問題是一個 NP-C ,因此我們會提出一個可以解的方法,但又不能證明它離最佳解差幾倍。通常這類的方法需要透過實驗來驗證其可行性及優越性。
    • Approximation Algorithms
      • 如果證明問題是一個 NP-C ,所以我們很難找到有效率的解法,如果可以證明你的方法算出來的值可以離最佳解有特定的範圍時,我們就稱之 ; 例如 2-approximation algorithm 代表離最佳解在 2 倍的範圍。
  • 61. 解決問題的層次 ( 二 )
    • On-Line Algorithms
      • 有些問題的 input 是動態進來的 , 因此不可能看完所有的 input 再來運算,而這些問題可能很難或是 NP-C 的問題,對於解決這些問題的演算法稱為 on-line algorithms ,跟 approximation algorithms 一樣, on-line algorithms 也需要一些指標來區別好壞。 Competitive analysis formalizes this idea by comparing the relative performance of an online and offline algorithm for the same problem instance.
  • 62. 解決問題的層次 ( 三 )
    • Randomized algorithm
      • 所有在計算過程中有利用到 Random bit 的演算法,都算是 Randomized algorithm 。
      • 一般分為兩類
        • 蒙地卡羅:計算出來的結果有大於五成的機率會是對的。
        • 拉斯維加斯:計算出來的結果有大於五成的機率會是對的。但是不見得每次都算得出來。
  • 63. 解決問題的層次 ( 四 )
    • Randomized algorithm
      • 所有在計算過程中有利用到 Random bit 的演算法,都算是 Randomized algorithm 。
      • 一般分為兩類
        • 蒙地卡羅:計算出來的結果有大於五成的機率會是對的。
        • 拉斯維加斯:計算出來的結果有大於五成的機率會是對的。但是不見得每次都算得出來。
    • External Memory Algorithm
      • An algorithm that is efficient when accessing most of the data is very slow, such as, on disk.
  • 64. 解決問題的層次 ( 五 )
    • Parallel Algorithm
      • 使用大量的 CPU 同時計算的演算法。不同的平行電腦架構需要設計不同的演算法。最簡單的一種是 multi-threading.
  • 65. 參考資料來源
    • Internet
    • 韓永楷教授,隨機演算法
    • 王炳豐教授,平行演算法
    • 林俊淵教授,平行計算 (CUDA)
    • 鐘葉青教授,平行程式設計
    • 強者正妹學姐 – 劉至善
  • 66. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 進階資料結構
  • 67. 生物相關問題與應用
    • 比較基因體學
    • 系統生物學
    • 轉譯醫學
  • 68. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 進階資料結構
  • 69. 字串排比
    • Global and local alignments
    • Multiple sequence alignment
    • Basic Local Alignment Search Tool (BLAST)
  • 70. Global Alignment vs. Local Alignment
    • global alignment :
    • local alignment :
  • 71. 兩個序列的分析
    • 在 1970 年代,分子生物學家 Needleman 及 Wunsch [15] 以動態程式設計技巧 (dynamic programming) 分析了氨基酸序列的相似程度;
    • 有趣的是,在同一時期,計算機科學家 Wagner 及 Fisher [22] 也以極相似的方式來計算兩序列間的編輯距離 (edit distance) ,而這兩個重要創作當初是在互不知情下獨立完成的。
    • 雖然分子生物學家看的是兩序列的相似程度,而計算機科學家看的是兩序列的差異,但這兩個問題已被證明是對偶問題 (dual problem) ,它們的值是可藉由公式相互轉換的。
  • 72. Homology Search Tools
    • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987)
    • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985)
    • BLAST (Altschul et al., 1990; Altschul et al., 1997)
    • BLAT (Kent, 2002)
    • PatternHunter (Li et al., 2004)
  • 73. 三種常用的序列分析方法
    • 目前序列分析工具可說是五花八門 ,僅管如此,有三種構想是較受大家所青睬的:
    • 第一種是 Smith-Waterman 的方法,這種方法很精細地計算兩序列間最好的 k 個區域排比 (local alignment) ,雖然這個方法很精確,但因耗時較久,所以多半應用在較短序列間的比較,然而,也有一些學者試著去改善它的一些計算複雜度,使它在長序列的比較上也有一些實際的應用。
  • 74. 三種常用的序列分析方法 ( 續 )
    • 第二種是 Pearson 的 FASTA 方法,這種方法先以較快方式找到一些有趣的區域,然後再以 Smith-Waterman 的方法應用在那些區域中。如此一來,它的計算速度就比 Smith-Waterman 快,而且在很多情況下,它的精細程度也不差。
    • 第三種是 Altschul 等人所製作的 BLAST ,它的最初版本完全沒有考慮間隔 (gap) ,所以在計算上比其他方式快了許多。雖然它不夠經細,但它的計算速度使得它在生物序列資料庫的搜尋上有很大的優勢,也因此它可說是目前最受歡迎的序列分析工具。此外, 1997 年剛出爐的 Gapped BLAST 已針對精細程度做了很大的改進,且在計算速度上仍維持相當的優勢。
  • 75. 為什麼要用排比 (alignment) ?
    • 早期的序列分析通常是以點矩陣 (dot matrix) 方法來進行的,這種方法是以二維平面將兩序列間相同的地方點出來,從而藉由目視的方式看看兩序列有那些相似的地方。這種方法最大的優點是一目了然且計算簡單;
    • 然而,當序列較長的時候,藉由目視方法去分析它們是一種很沒有效率的方式
    • 況且有些生物序列 ( 如蛋白質序列 ) 並不是只有相同字符才相似,這時候點矩陣方法就無法看出整體的相似程度。
    • 於是有人建議以排比 (alignment) 來顯示兩序列的相似程度
  • 76. 排比 (alignment)
    • 給定兩序列 , 它們的整體排比 (global alignment) 是在兩序列內加入破折號 (dash) ,使得兩序列變得等長且相對的位置不會同時都是破折號。
    • 例如:假設兩序列為CTTGACTAGA及CTACTGTGA,下圖列出了它們的一種排比。
    •       CTTGACT-AGA
    •      CT--ACTGTGA
    • 圖 : CTTGACTAGA及CTACTGTGA的一種可能排比。
  • 77. 排比的評分方式
    • 有那麼多種不同的排比組合,到底要挑那一個排比呢?為了要挑出較理想的排比,通常我們需要一些評分方式來做篩選工作。
    • 最簡單的評分方式是將每一個配對基底 (aligned pair) 都給一個分數,再看看那一種排比的總分最高。令 w(a,b) 代表 a 與 b 配對所得到的分數 ( 通常 w(*,-) 及 w(-,*) 是負值; mismatch 也是負值;只有 match 是正值,而蛋白質序列分析則採用 PAM 矩陣或 BLOSUM 矩陣來決定這些值 )
    • 在上述的簡單評分原則下,前圖的排比所得到的分數為 w(C,C) + w(T,T) + w(T,-) +…+ w(A,A)
  • 78. 最佳排比演算法
  • 79. 「同盟線性評分法」 (affine gap penalties)
    • 我們可以用動態規畫技巧由小到大依序將 S(i, j) 算出,並且記錄最佳值的由來,如此一來,在計算完了之後,我們也能一舉將最佳排比回溯出來。
    • 在比較生物序列時,我們通常會對每個破折號區域另外扣一個懲罰分數 ( 令其為 α) ,破折號區域也就是我們常說的「間隔」 (gap) ,如果破折號發生在第一個序列我們稱之為「插入間隔」 (insertion gap) ;如果發生在第二個序列我們稱之為「刪除間隔」 (deletion gap) 。
    • 例如在前圖的排比中,我們有一個長度為 2 的刪除間隔及一個長度為 1 的插入間隔,所以在排比分數上還要扣去兩個間隔的分數 (2α) 。我們通常稱這樣的評分方式為「同盟線性評分法」 (affine gap penalties)
  • 80. 最佳排比演算法
  • 81. 最佳排比演算法 ( 續 )
  • 82. 區域排比 (local alignment)
    • 在做生物序列排比時,有時更有趣的是找出局部區域的相似程度,此時我們考慮的是所謂的區域排比 (local alignment) ,也就是我們不必從頭到尾排比整個序列,而只要找出序列一的某個區段和序列二的某個區段之最佳排比即可。
    • 我們在此以最簡單的評分方式 ( 也就是以每一個配對基底 (aligned pair) 分數的總和為排比分數 ) 來說明如何計算最佳區域排比。
  • 83. 最佳區域排比的演算法
  • 84. 為什麼要加個 0 ?
    • 和整體排比 (global alignment) 的遞迴關係相比較,你會發現這裡的遞迴關係只多了 0 這一項,原因是 整體排比要從序列前端開始排起,而區域排比卻是任一個地方都可能是個起點 ,如果往前連接分數小於 0 ,我們就不該往前串聯,而以此點做為一個起點 ( S(i, j)=0 ) 試試看。
  • 85. 多個最佳區域排比
    • 有些人感興趣的是找出 k 個最好的區域排比或是分數至少有設定值那麼高的所有區域排比,這樣的計算在你熟悉動態規畫技巧後應不至於難倒你的。
    • 上述的方式也就是一般俗稱的 Smith-Waterman 方法 ( 實際上,整體排比問題是由 Needleman 及 Wunsch[15] 所提出;而區域排比問題則是由 Smith 及 Waterman[21] 所提出 ) ,它基本上需要與兩序列長度乘積成常數正比的時間與空間。
    • 在序列很長時,這種計算時間及空間都是很難令人接受的!
  • 86. BLAST
    • Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman)
    • The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
  • 87. The maximal segment pair measure
    • A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4)
    the highest scoring pair
    • The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming.
    • BLAST heuristically attempts to calculate the MSP score.
  • 88. A matrix of similarity scores PAM 120
  • 89. A maximum-scoring segment
  • 90. BLOSUM62 versus PAM250 (For Protein)
  • 91. BLAST
    • Build the hash table for Sequence A.
    • Scan Sequence B for hits.
    • Extend hits.
  • 92. BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;
  • 93. BLAST Step2: Scan sequence B for hits.
  • 94. BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST 2.0 saves the time spent in extension, and considers gapped alignments.
  • 95. Gapped BLAST (I) The two-hit method
  • 96. Gapped BLAST (II) Confining the dynamic-programming
  • 97. BLAT
  • 98. 多重序列排比
    • 多序列的分析一直是計算生物學上很重要的課題,但是它的問題複雜度卻令人沮喪。粗略地說,比較兩個長度皆為 n 的序列所需的時間 ( 也就是動態規畫矩陣點 ) 是和 n 的平方成常數正比的;而比較 k 個長度皆為 n 的序列所需的時間則和 n 的 k 次方成常數正比。
    • 試想如果我們要同時比較 10 個長度只有 200 的序列所需的時間是多少呢?它基本上會和 200 的 10 次方成常數正比,而這卻是很龐大的數目。
  • 99. 多重序列排比的計算方法
    • 因此,在計算方法上有兩種不同的流派:一種是 Lipman 等人提出的方式,他們做的是同時比較多個序列,但試著去降低計算時所用的動態規畫矩陣點,據他們的論文指出,這種方式比較 10 個長度為 200 的序列也不會遭遇太大的問題;
    • 另一種方式是 Feng 及 Doolittle 所採用的,它根據序列遠近程度的演化樹來做序列排比,一旦 gap 在某個比較中出現後,它就會被保留到最後,這種方法用來比較 k 個長度皆為 n 的序列所需時間約略與成常數正比,所以非常廣受歡迎。
  • 100. 多重序列排比的評分方式
    • 最廣為接受的一種方式稱為 SP ( Sum-of –Pairs ) 分數,這種方式將多重序列排比投影到每一對序列上所得的排比分數總和起來,做為該多重排比的分數。
    • 這種方式若要直接採用「同盟線性評分方式」,則會滋生非常多的動態規畫設計表,但若稍稍放鬆一下,「類似同盟線性評分方式」 (quasi affine gap penalties) 雖然不夠精準,但卻可較有效率地計算多重排比的分數,是最常被用到的變形評分方式。
    • 此外,有些人建議某些序列組合應加權計分;也有人根據演化樹來計算分數
  • 101. 參考資料來源
    • Internet
    • 盧錦隆教授:計算生物學
    • Mount, David W. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press, 2001.
  • 102. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 進階資料結構
  • 103. 生物網路與圖形理論
    • Protein-protein interaction networks, gene regulation networks, etc.
    • Basic Graph Theory
    • Networks Motif, Community Detection
    • Some Graph algorithms
  • 104. Bio-Map protein-gene interactions protein-protein interactions PROTEOME GENOME Citrate Cycle METABOLISM Bio-chemical reactions
  • 105. Boehring-Mennheim
  • 106. An Introduction to Graph Theory Definitions and Examples Undirected graph Directed graph isolated vertex adjacent loop multiple edges simple graph : an undirected graph without loop or multiple edges degree of a vertex: number of edges connected (indegree, outdegree) G =( V , E )
  • 107. x y path : no vertex can be repeated a-b-c-d-e trail : no edge can be repeat a-b-c-d-e-b-d walk : no restriction a-b-d-a-b-c closed if x = y closed trail: circuit (a-b-c-d-b-e-d-a, one draw without lifting pen) closed path: cycle (a-b-c-d-a) a b c d e length : number of edges in this (path,trail,walk)
  • 108. a x b remove any cycle on the repeated vertices Def 11.4 Let G =( V , E ) be an undirected graph. We call G connected if there is a path between any two distinct vertices of G . a b c d e a b c d e disconnected with two components
  • 109. 雙分圖 Bipartite graphs
    • A graph that can be decomposed into two partite sets but not fewer is bipartite
    • It is a complete bipartite if its vertices can be divided into two non-empty groups, A and B. Each vertex in A is connected to B, and vice-versa
    Complete bipartite graph K 2,3 The graph is bipartite
  • 110. Def. 11.6 multigraph of multiplicity 3 multigraphs
  • 111. Subgraphs, Complements, and Graph Isomorphism a b c d e a b c d e b c d e a c d spanning subgraph V 1 = V induced subgraph include all edges of E in V 1
  • 112. Subgraphs, Complements, and Graph Isomorphism Def. 11.11 complete graph: K n a b c d e K 5 Def. 11.12 complement of a graph G G a b c d e a b c d e
  • 113. Subgraphs, Complements, and Graph Isomorphism Graph Isomorphism 1 2 3 4 a b c d w x y z
  • 114. Subgraphs, Complements, and Graph Isomorphism Ex. 11.8 q r w z x y u t v a b c d e f g h i j a-q c-u e-r g-x i-z b-v d-y f-w h-t j-s, isomorphic Ex. 11.9 degree 2 vertices=2 degree 2 vertices=3 Can you think of an algorithm for testing isomorphism?
  • 115. Module
  • 116. Module
  • 117. Network Motif
  • 118. Graph Alignment: NetworkBLAST/PathBLAST
  • 119. Centralities
    • Degree centrality : number of direct neighbors of node v
      • where N(v) is the set of direct neighbors of node v .
    • Stress centrality : the simple accumulation of a number of shortest paths between all node pairs
      • where ρ st (v) is the number of shortest paths passing through node v.
  • 120. Centralities
    • Closeness centrality : reciprocal of the total distance from a node v to all the other nodes in a network
      • δ (u,v) is the distance between node u and v .
    • Eccentricity : the greatest distance between v and any other vertex
  • 121. Centralities
    • Shortest path based betweenness centrality : ratio of the number of shortest paths passing through a node v out of all shortest paths between all node pairs in a network
      • σ st is the number of shortest paths between node s and t and σ st (v) is the number of shortest paths passing on a node v out σ st
    • Current flow based betweenness centrality : the amount of current that flows through v in a network
      • Random walk based betweenness centrality
  • 122. Centralities
    • Subgraph centrality : accounts for the participation of a node in all sub graphs of the network.
    • the number of closed walks of length k starting and ending node v in the network is given by the local spectral moments μ k (v).
  • 123. Weighted Bipartite Matching
  • 124. Weighted Bipartite Matching Given a weighted bipartite graph, find a matching with maximum total weight. Not necessarily a maximum size matching. A B
  • 125. History
    • Example of the assignment problem
    • Say you have three workers: Jim , Steve & Allan . You need to have one of them clean the bathroom, another sweep the floors & the third wash the windows. What’s the best (minimum-cost) way to assign the jobs?
  • 126. Hungarian algorithm (Augmenting Path Algorithm)
    • Orient the edges (edges in M go up, others go down)
    • edges in M having positive weights, otherwise negative weights
    Find a shortest path M-augmenting path at each step
  • 127. Example
    • One company assigns 5 types of jobs to 5 persons (Alice, Bob, Chris, Dirk, Emma). Each person has different ability to do each job. The different profits of the person assigned to specific job are shown below (Actually this is the cost matrix).
    Job 1 Job 2 Job 3 Job 4 Job 5 Alice 1$ 2$ 3$ 4$ 5$ Bob 6$ 7$ 8$ 7$ 2$ Chris 1$ 3$ 4$ 4$ 5$ Dirk 3$ 6$ 2$ 8$ 7$ Emma 4$ 1$ 3$ 5$ 4$
  • 128. Example
    • Step 0 : Initialization. Let
    • Form an excess matrix (using )
    Cost matrix Excess matrix
  • 129. Example
    • Step 1 : Construct equality subgraph
    Excess matrix
  • 130. Example
    • Step 2 Maximum Matching in subgraph
    • Find a maximum matching in it. If is a perfect matching, stop and report as a maximum weight matching and as a minimum cost cover.
  • 131. Example
    • Step 2 (continue..)
    • Choose Job 3, Job 4 and Job 5 as vertex cover with size equal to
  • 132. Example Excess matrix
    • Step 3 Dual Change
    • is not a cover of
    • Find , using
    is a edge of not covered by
  • 133. Example
    • Step 3 (continue..)
    • Update , and excess matrix, using
    Cost matrix Excess matrix
  • 134. Example
    • Step 1 : Construct equality subgraph
    Excess matrix
  • 135. Example
    • Step 2 Maximum Matching in subgraph
  • 136. Example
    • Step 2 (continue..)
    • Choose Bob, Job 1, Job 4 and Job 5 as vertex cover with size equal to
  • 137. Example Excess matrix
    • Step 3 Dual Change
    • is not a cover of
    • Find , using
    is a edge of not covered by
  • 138. Set Cover
    • Definition of set cover problem
      • Given a set of elements B and its subset S 1 , S 2 ,…, S n (i.e. )
      • Find a selection of subsets such that the union of picked sets is exactly B
      • Cost of selection is defined as the number of picked sets
  • 139. Set Cover
    • Greedy solution is extremely natural and intuitive to set cover problem
      • Pick the subset with largest number of uncovered elements
      • Until all elements of B are covered
    • Can such a greedy strategy find optimal solution (selection with minimized cost)?
  • 140. Set Cover
    • Example
      • The dots in figure represent towns in country, and the edges are paths between towns
      • Now we’re planning to build schools
      • Students should go to school within one move on path
      • Then, what’s the minimum number of schools needed to be build in towns?
  • 141. Set Cover
    • Example (cont.)
      • Our greedy solution would select town a first (since it covers six adjacencies b, d, e, h, i, k )
      • Then uncovered towns f, c, j are chosen one-by-one
      • Totally four schools are built in town a, c, f, and j
    Optimal?
  • 142. Set Cover
    • Example
      • There exists a solution with just three schools, at b, e, and i
      • The greedy solution is not optimal!
  • 143. Set Cover
    • Greedy fail?
      • In fact, our greedy algorithm has found an approximation
      • It claims that the greedy algorithm will use at most k *ln( n ) sets if optimal solution pick k sets for n -element set cover problem
      • The approximation factor of the greedy algorithm is k *ln( n ) / k = ln( n ), means we’re not too far from the optimal
  • 144. Example: A C B D G E F Karger’s Min-Cut Algorithm
  • 145. Example: A C B D G E F contract
  • 146. Example: A C B D G E F contract A C B D E FG
  • 147. Example: A C B D G E F contract A C B D E FG contract
  • 148. Example: A C B D G E F contract A C B D E FG contract A C B E FGD
  • 149.  
  • 150. Is output min-cut?
    • Not necessarily.
    • Is it a cut?
  • 151.  
  • 152. 參考資料來源
    • Internet
    • 唐傳義教授:系統生物學導論
    • 蔡明哲教授:圖形理論
    • 強者正妹學姐 – 劉至善
    • Graph Theory with Applications J.A. Bondy and U.S.R. Murty
    • Graph Theory, by Reinhard Diestel
  • 153. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 人工智慧與機械學習
      • 進階資料結構
  • 154. 人工智慧與機械學習
    • 演算法 vs. 機械學習
    • 工人智慧 vs. 人工智慧
    常見的工具: Decision Tree, SVM, Neural Networks, Random Forest
  • 155.
    • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
    • Traditional Techniques may be unsuitable due to
      • Enormity of data
      • High dimensionality of data
      • Heterogeneous, distributed nature of data
    Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems
  • 156. Illustrating Classification Task
  • 157. Decision Tree
  • 158. Example of a Decision Tree Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree categorical categorical continuous class
  • 159. Another Example of Decision Tree categorical categorical continuous class MarSt Refund TaxInc YES NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! NO
  • 160. Decision Tree Classification Task Decision Tree
  • 161. Apply Model to Test Data Test Data Start from the root of tree. Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
  • 162. Apply Model to Test Data Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
  • 163. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
  • 164. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
  • 165. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
  • 166. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Cheat to “No”
  • 167. Support Vector Machine
  • 168. S upport Vector Machines Which hyperplane? y=1 y=-1
  • 169. S upport Vector Machines Margin Margin = |d + |+|d - | y=1 y=-1 d + d - d -
  • 170. S upport Vector Machines Maximum Margin d + d + d - d - y=1 y=-1 Support vectors
  • 171. S upport Vector Machines Margin d b i1 b i2 y=1 y=-1 d x 1 x 2 x 1 -x 2 Page:261 5.32 5.33 5.34
  • 172. S upport Vector Machines Objective function y=1 y=-1 d
    • The learning task in SVM can be formalized as the following constrained optimization problem :
    Page:262 Definition 5.1
  • 173. Artificial Neural Network
  • 174. Perceptron (1)
    • Single neuron model (linear threshold unit)
      • Input: a linear combination  W i X i
      • Output: threshold function
    w 1 w 2 w n  x 1 x 2 x n x 0 = 1 w 0
  • 175. Perceptron (2)
    • Multiple real-valued inputs: ( x 1 , x 2 , x 3 , ..., x n ) (= )
    • Single output (labeled +1/-1): o ( x 1 , x 2 , x 3 , ..., x n )
    • Weights (real-valued constants): ( w 0 , w 1 , w 2 , w 3 , ..., w n ) (= )
      • Real-valued constants to be determined and to be fit in learning problem ( i.e ., the space H of candidate hypothesis is the set of all possible real-valued weight vectors )
      • In order to output a +1 for the percepton, the weighted combination w 1 x 1 +…+ w n x n must surpass (- w 0 )
    • Input-output relationship:
    • In vector form, with x 0 = 1
        • Where sgn ( ) is 1 if argument is positive, -1 otherwise
  • 176. Decision Surface of a Perceptron (1)
    • Represents some useful functions
      • For example, Boolean functions
        • Both inputs and output are Boolean values
        • Assume Boolean values of +1 (true) and –1 (false)
      • What weights represent AND (x 1 , x 2 )?
        • w 0 = -0.8, w 1 = w 2 = 0.5
        • o ( x 1 ,x 2 ) = sgn ( -0.8 + 0.5x 1 + 0.5 x 2 )
  • 177. Decision Surface of a Perceptron (2)
    • Similarly
      • OR (x 1 , x 2 )
        • w 0 = 0.3, w 1 = w 2 = 0.5
        • o ( x 1 ,x 2 ) = sgn ( 0.3 + 0.5x 1 + 0.5 x 2 )
      • NOT (x 1 ) :
        • w 0 =0.0, w 1 = -1.0
        • o ( x 1 ) = sgn( 0.0 –1.0x 1 )
  • 178. Sigmoid Unit x 1 x 2 x n w 1 w 2 w n  x 0 = 1 w 0
  • 179. Multilayer Networks (1)
    • Much greater representational power
    • Can find nonlinear decision surfaces
    • Multilayer network is made up of many simple interconnected units
      • Feedforword networks are acyclic, directed graphs
      • output of unit passed to inputs of successive units
    o 1 o 2 w 43 Output Layer x 1 x 2 x 3 Input Layer w 11 h 1 h 2 h 3 h 4 Hidden Layer
  • 180. 參考資料來源
    • Internet
    • 資工系:統計學習理論 , 資料探勘 , 人工智慧 , 樣式辨別
    • 電機系:樣式辨別
    • 佳揚學姐,筌敬學長。
    • Machine Learning, Tom Mitchell, McGraw Hill, 1997.
    • R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. 2ed John Wiley, 2001.
  • 181. Outline
    • 基本的演算法分析概念
    • 基本的計算理論概念
    • 常見的演算法分類
    • 生物相關問題與應用:
      • 字串排比
      • 生物網路與圖形理論
      • 人工智慧與機械學習
      • 進階資料結構
  • 182. Advanced Data Structure
    • Suffix Tree
    • Bloom filter
    • Randomized Search Trees
    • Priority Search Trees.
  • 183. Indexing
    • Using a sparse representation , a database can be preprocessed in linear time to allow locating all instances of a short string.
    • Major limitation: search is restricted to fixed length strings .
  • 184. S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 $ YALAM$ M $ ALAYALAM$ $M YALAM$ $M YALAM$ $M YALAM$ A AL LA 6 2 8 4 7 3 1 9 5 10 Suffix Trees Paths from root to leaves represent all suffixes of S
  • 185. M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 $ YALAM$ M $ ALAYALAM$ $M YALAM$ $M YALAM$ $M YALAM$ A AL LA 6 2 8 4 7 3 1 9 5 10 Suffix Tree
  • 186. Suffix tree properties
    • For a string S of length n , there are n+1 leaves and at most n internal nodes.
      • therefore requires only linear space,
      • provided edge labels are O(1) space
    • Each leaf represents a unique suffix.
    • Concatenation of edge labels from root to a leaf spells out the suffix.
    • Each internal node represents a distinct common prefix to at least two suffixes.
  • 187. Application: Finding a short Pattern in a long String
    • Build a suffix tree of the string.
    • Starting from the root, traverse a path matching characters of the pattern.
    • If stuck, pattern not present in string.
    • Otherwise, each leaf below gives a position of the pattern in the string.
  • 188. Finding a Pattern in a String Find “ALA” $ YALAM$ M $ ALAYALAM$ M$ YALAM$ M$ YALAM$ M$ YALAM$ A AL LA 6 2 8 4 7 3 1 9 5 10 Two matches - at 6 and 2
  • 189. (10, 10) (5, 10) (1, 1) (10, 10) (2, 10) (3, 4) (5, 10) (9, 10) (2, 2) (5, 10) (9, 10) (3, 4) (9, 10) (5, 10) 6 2 8 4 7 3 1 9 5 10 Edge Encoding S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10
  • 190. N äive Suffix Tree Construction Before starting: Why exactly do we need this $ , which is not part of the alphabet? $ 10 M$ 9 AM$ 8 LAM$ 7 ALAM$ 6 YALAM$ 5 AYALAM$ 4 LAYALAM$ 3 ALAYALAM$ 2 MALAYALAM$ 1
  • 191. N äive Suffix Tree Construction $MALAYALAM LAYALAM$ 1 2 LAYALAM$ 3 A 2 3 4 4 YALAM$ etc. $ 10 M$ 9 AM$ 8 LAM$ 7 ALAM$ 6 YALAM$ 5 AYALAM$ 4 LAYALAM$ 3 ALAYALAM$ 2 MALAYALAM$ 1
  • 192. Is Suffix Tree good?
    • Yes, because optimal search time
    • No, because of space requirement…
      • The space can be much larger than the text
      • E.g., Text = DNA of Human
      • To store the text, we need 0.8 Gbyte
      • To store the suffix tree, we need 64 Gbyte!
  • 193. Something Wrong??
    • Both the suffix tree and the text has n things, so they both need O(n) space…
    • How come there is a big difference??
      • Let us have a better analysis
    • Let A be the alphabet (i.e., the set of distinct characters) of a text T
      • E.g., in DNA, A = {a,c,g,t}
  • 194. Something Wrong?? (2)
    • To store T, we need only n log |A| bits
    • But to store the suffix tree, we will need n log n bits
    • When n is very large compared to |A| , there is a huge difference
    • Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??
  • 195. Suffix Array – Reducing Space M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 Suffix Array : Lexicographic ordering of suffixes Derive Longest Common Prefix array Suffix 6 and 2 share “ALA” Suffix 2,8 share just “A”. lcp achieved for successive pairs . $ 10 YALAM$ 5 M$ 9 MALAYALAM$ 1 LAYALAM$ 3 LAM$ 7 AYALAM$ 4 AM$ 8 ALAYALAM$ 2 ALAM$ 6 10 5 9 1 3 7 4 8 2 6 - 0 0 1 0 2 0 1 1 3
  • 196. Example Text Position Suffix Array 3 1 1 0 2 0 1 0 0 lcp Array M M A L A Y A L A $ 1 2 3 4 5 6 7 8 9 10 3 7 4 10 5 8 9 1 2 6 $ 10 YALAM$ 5 M$ 9 MALAYALAM$ 1 LAYALAM$ 3 LAM$ 7 AYALAM$ 4 AM$ 8 ALAYALAM$ 2 ALAM$ 6
  • 197. Pattern Search in Suffix Array
    • All suffixes that share a common prefix appear in consecutive positions in the array.
    • Pattern P can be located in the string using a binary search on the suffix array.
    • Naïve Run-time = O (|P|  log n).
    • Improved to O (|P| + log n) [Manber&Myers93], and to O(|P|) [Abouelhoda et al. 02].
  • 198. Known (amazing) Results
    • Suffix tree can be constructed in O ( n ) time and O ( n  | ∑ |) space [Weiner73, McCreight76, Ukkonen92].
    • Suffix arrays can be constructed without using suffix trees in O ( n ) time [Pang&Aluru03].
  • 199. More Applications
    • Suffix-prefix overlaps in fragment assembly
    • Maximal and tandem repeats
    • Shortest unique substrings
    • Maximal unique matches [MUMmer]
    • Approximate matching
    • Phylogenies based on complete genomes
  • 200. Approximate set membership problem
    • Suppose we have a set
    • S = {s 1 ,s 2 ,...,s m }  universe U
    • Represent S in such a way we can quickly answer “ Is x an element of S ?”
    • To take as little space as possible ,we allow false positive (i.e. x  S , but we answer yes )
    • If x  S , we must answer yes .
  • 201. Bloom filters
    • Consist of an arrays A[n] of n bits (space) , and k independent random hash functions
    • h 1 ,…,h k : U --> {0,1,..,n-1}
    • 1. Initially set the array to 0
    • 2.  s  S, A[h i (s)] = 1 for 1  i  k
    • (an entry can be set to 1 multiple times, only the first times has an effect )
    • 3. To check if x  S , we check whether all location A[h i (x)] for 1  i  k are set to 1
    • If not, clearly x  S.
    • If all A[h i (x)] are set to 1 ,we assume x  S
  • 202. 0 0 0 0 0 0 0 0 0 0 0 0 Initial with all 0 1 1 1 1 1 x 1 x 2 Each element of S is hashed k times Each hash location set to 1 1 1 1 1 1 y To check if y is in S, check the k hash location. If a 0 appears , y is not in S 1 1 1 1 1 y If only 1s appear, conclude that y is in S This may yield false positive
  • 203. The probability of a false positive
    • We assume the hash function are random.
    • After all the elements of S are hashed into the bloom filters ,the probability that a specific bit is still 0 is
  • 204.
    • To simplify the analysis ,we can assume a fraction p of the entries are still 0 after all the elements of S are hashed into bloom filters.
    • In fact,let X be the random variable of number of those 0 positions. By Chernoff bound
    • It implies X/n will be very close to p with a very high probability
  • 205.
    • The probability of a false positive f is
    • To find the optimal k to minimize f .
    • Minimize f iff minimize g=ln(f)
    • k=ln(2)*(n/m)
    • f = (1/2) k = (0.6185..) n/m
    • The false positive probability falls exponentially in n/m ,the number bits used per item !!
  • 206.
    • A Bloom filters is like a hash table ,and simply uses one bit to keep track whether an item hashed to the location.
    • If k=1 , it’s equivalent to a hashing based fingerprint system.
    • If n=cm for small constant c,such as c=8 ,then k=5 or 6 ,the false positive probability is just over 2% .
    • It’s interesting that when k is optimal
    • k=ln(2)*(n/m) , then p= 1/2.
    • An optimized Bloom filters looks like a random bit-string
  • 207.  
  • 208.  
  • 209.  
  • 210.  
  • 211. Deterministic Tools
    • AVL Tree
    • Red-Black Tree
    • Fib. Heap
    • Splay Tree
    • Soft Heap
      • NOT EASY TO IMPLEMENT
  • 212. Range Searching
    • S = set of geometric objects
    • Q = query object
    • Report/Count objects in S that intersect Q
    Query Q Report/Count answers
  • 213. Single-shot Vs Repeatitive
    • Query may be:
    • Single-shot (one-time). No need to preprocess
    • Repeatitive-Mode . Many queries are expected. Preprocess S into a Data Structure so that queries can be answered fast
  • 214. Orthogonal Range Searching in 1D
    • S: Set of points on real line.
    • Q= Query Interval [a,b]
    a b Which query points lie inside the interval [a,b]?
  • 215. Orthogonal Range Searching in 2D
    • S = Set of points in the plane
    • Q = Query Rectangle
  • 216.
    • Build a balanced search tree where all data points are stored in the leaves .
    1D Range Query 7 7 19 15 12 8 2 4 5 2 4 5 8 12 15 2 4 5 7 8 12 15 19 query: O(log n+k) space: O(n) 6 17
  • 217. Querying Strategy
    • Given interval [a,b], search for a and b
    • Find where the paths split, look at subtrees inbetween
    Paths split a b Problem: linking leaves do not extends to higher dimensions. Idea: if parents knew all descendants, wouldn’t need to link leaves.
  • 218. Efficiency
    • Preprocessing Time: O(n log n)
    • Space: O(n)
    • Query Time: O(log n + k)
    • k = number of points reported
    • Output-sensitive query time
    • Binary search tree can be kept balanced in O(log n) time per update in dynamic case
  • 219. 1D Range Counting
    • S = Set of points on real line
    • Q= Query Interval [a,b]
    • Count points in [a,b]
    • Solution: At each node, store count of number of points in the subtree rooted at the node.
    • Query: Similar to reporting but add up counts instead of reporting points.
    • Query Time: O(log n)
  • 220. 2D Range queries
    • How do you efficiently find points that are inside of a rectangle?
      • Orthogonal range query ([ x 1 , x 2 ], [ y 1 ,y 2 ]): find all points ( x, y ) such that x 1 <x<x 2 and y 1 <y<y 2
    x y x 1 x 2 y 1 y 2
  • 221. Range trees
      • Canonical subset P ( v ) of a node v in a BST is a set of points (leaves) stored in a subtree rooted at v
      • Range tree is a multi-level data structure:
        • The main tree is a BST T on the x -coordinate of points
        • Any node v of T stores a pointer to a BST T y ( v ) ( associated structure of v ), which stores canonical subset P ( v ) organized on the y -coordinate
        • 2D points are stored in all leaves!
    BST on y-coords P ( v ) T y ( v ) T P ( v ) v BST on x-coords
  • 222.
    • For each internal node v  T x let P( v ) be set of points stored in leaves of subtree rooted at v.
    • Set P( v ) is stored with v as another balanced binary search tree T y ( v ) (descendants by y) on y-coordinate. (have pointer from v to T y ( v ))
    Range trees T x v P( v ) T y ( v ) P( v ) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 1 p 2 p 3 p 4 p 5 p 6 p 7 v T 4 p 7 p 5 p 6 T y ( v )
  • 223.
    • The diagram below shows what is stored at one node. Show what is stored at EVERY node. Note that data is only stored at the leaves.
    Range trees T x v P( v ) T y ( v ) P( v ) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 1 p 2 p 3 p 4 p 5 p 6 p 7 v T 4 p 7 p 5 p 6 T y ( v )   x y p1 1 2.5 p2 2 1 p3 3 0 p4 4 4 p5 4.5 3 p6 5.5 3.5 p7 6.5 2
  • 224. Range trees The query time : Querying a 1D-tree requires O(log n+k) time. How many 1D trees (associated structures) do we need to query? At most 2  height of T = 2 log n Each 1D query requires O(log n+k’) time.  Query time = O(log 2 n + k) Answer to query = Union of answers to subqueries: k = ∑k’ . Query: [x,x’] x x’
  • 225. Size of the range tree
    • Size of the range tree :
      • At each level of the main tree associated structures store all the data points once (with constant overhead): O ( n ).
      • There are O (log n ) levels.
      • Thus, the total size is O ( n log n ).
  • 226. Building the range tree
    • Efficient building of the range tree:
      • Sort the points on x and on y (two arrays: X , Y ).
      • Take the median v of X and create a root, build its associated structure using Y.
      • Split X into sorted X L and X R , split Y into sorted Y L and Y R (s.t. for any p  X L or p  Y L , p.x < v.x and for any p  X R or p  Y R , p.x  v.x ).
      • Build recursively the left child from X L and Y L and the right child from X R and Y R.
    • The running time is O ( n log n ).
  • 227. Generalizing to higher dimensions
    • d-dimensional Range Tree can be build recursively from (d-1) dimensional range trees.
    • Build a Binary Search Tree on coordinates for dimension d.
    • Build Secondary Data Structures with (d-1) dimensional Range Trees.
    • Space O(n log d-1 n).
    • Query Time O(log d n + k).
  • 228. 參考資料來源
    • Internet
    • 盧錦隆教授:計算生物學
    • 韓永楷教授:隨機演算法
    • 潘雙洪老師:計算幾何